Re: [Nutch-general] Querying Fields

Lukas Vlcek Fri, 11 Aug 2006 02:27:39 -0700

Hi,

You need to look into source to find out what exactly it does. As far
as I know it does not add any new filed into index (it should be done
via index-more plugin) but it allows you to query using type: date:
and site: I think.


Lukas

On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
> What does exactilly the query-more plugin? I tested it a few minutes ago and
> it dont add any field to the result index. It's used in the webapp? Could
> you give me a clarification about it?
>
> Thanks!
>
> On 8/9/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> >
> > If my memory serves me correctly then query-more should work fine with
> > 0.7.2 nutch too.
> > And you are right Matthew, you need to use both [type:] or [date:]
> > filters in combination to [url:] as you can experience empty result
> > set if used in solo mode. I do queries like this: [url:http type:pdf]
> > and it gives me the result I need.
> >
> > Lukas
> >
> > On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
> > > All right! I've done this already. I thing you dont understand my
> > question.
> > > What I want to do is to query my indexes using something like
> > > "filetype:pdf". The version 0.8 already have this feature. But I'm using
> > the
> > > version 0.7.2 and I want to add this feature mannually. But I dont know
> > > where I have to edit. Do you know?
> > >
> > > Regards,
> > >
> > > Lourival Junior
> > >
> > > On 8/9/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > To allow more formats to be indexed you need to modify nutch-site.xml
> > > > and update/add plugin.includes property (see nutch-default.xml for
> > > > default settings). The following is what I have in nutch-site.xml:
> > > >
> > > > <property>
> > > >   <name>plugin.includes</name>
> > > >
> > > >
> > <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
> > > > </property>
> > > >
> > > > [parse-*] is used to parse various formats, [query-more] allows you to
> > > > use [type:] filter in nutch queries.
> > > >
> > > > Regards,
> > > > Lukas
> > > >
> > > > On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
> > > > > Hi Lukas and everybody!
> > > > >
> > > > > Do you know which file in nutch 0.7.2 should I edit to add some
> > field in
> > > > my
> > > > > index (i.e. file type - PDF, Word or html)?'
> > > > >
> > > > > On 8/8/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am not sure if I can give you any useful hint but the follwoing
> > is
> > > > > > what once worked for me.
> > > > > > Example of query: url:http date:20060801
> > > > > >
> > > > > > date: and type: options can be used in combination with url:
> > > > > > Filer url:http should select all documents (unless you allowed
> > file,
> > > > > > ftp protocols). Plain date ot type filter select onthing if they
> > are
> > > > > > used alone.
> > > > > >
> > > > > > And be sure you don't introduce any space between filter name and
> > its
> > > > > > value ([date: 20060801] is not the same as [date:20060801])
> > > > > >
> > > > > > Lukas
> > > > > >
> > > > > > On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
> > > > > > > Howie,
> > > > > > >    I inspected my index using Luke and 20060801 shows up several
> > > > times
> > > > > > > in the index. I'm unable to query pretty much any field. Several
> > > > people
> > > > > > > seem to be having the same problem. Does anyone know whats going
> > on?
> > > > > > >
> > > > > > > This is one of the last things I have to resolve to have Nutch
> > > > deployed
> > > > > > > successfully at my organization. Unfortunately, Friday is my
> > last
> > > > day.
> > > > > > > Can anyone offer any assistance??
> > > > > > > Thanks,
> > > > > > >   Matt
> > > > > > >
> > > > > > > Howie Wang wrote:
> > > > > > > > I think that I have problems querying for numbers and
> > > > > > > > words with digits in them. Now that I think of it, is it
> > > > > > > > possible it has something to do with the stemming in
> > > > > > > > either the query filter or indexing? In either case, I would
> > > > > > > > print out the text that is being indexed and the phrases
> > > > > > > > added to the query. You could also using luke to inspect
> > > > > > > > your index and see whether 20060801 shows up anywhere.
> > > > > > > >
> > > > > > > > Howie
> > > > > > > >
> > > > > > > >> I tried looked for a page that had the date 20060801 and the
> > text
> > > > > > > >> "test" in the page. I tried the following:
> > > > > > > >>
> > > > > > > >> date: 20060801 test
> > > > > > > >>
> > > > > > > >> and
> > > > > > > >>
> > > > > > > >> date 20060721-20060803 test
> > > > > > > >>
> > > > > > > >> Neither worked, any ideas??
> > > > > > > >>
> > > > > > > >> Matt
> > > > > > > >>
> > > > > > > >> Matthew Holt wrote:
> > > > > > > >>> Thanks Jake,
> > > > > > > >>>   However, it seems to me that it makes most sense that a
> > query
> > > > > > > >>> should return all pages that match the query, instead of
> > acting
> > > > as a
> > > > > > > >>> content filter. However, I know its something easy to
> > suggest
> > > > when
> > > > > > > >>> you're not having to implement it, so just a suggestion.
> > > > > > > >>>
> > > > > > > >>> Matt
> > > > > > > >>>
> > > > > > > >>> Vanderdray, Jacob wrote:
> > > > > > > >>>> Try querying with both the date and something you'd expect
> > to
> > > > find
> > > > > > > >>>> in the content.  The field query filter is just a
> > filter.  It
> > > > only
> > > > > > > >>>> restricts your results to things that match the basic query
> > and
> > > > has
> > > > > > > >>>> the contents you require in the field.  So if you query for
> > > > > > > >>>> "date:2006080 text" you'll be searching for documents that
> > > > contain
> > > > > > > >>>> "text" in one of the default query fields and has the value
> > > > 2006080
> > > > > > > >>>> in the date field.  Leaving out text in that example would
> > > > > > > >>>> essentially be asking for nothing in the default fields and
> > > > 2006080
> > > > > > > >>>> in the date field which is why it doesn't return any
> > results.
> > > > > > > >>>>
> > > > > > > >>>> Hope that helps,
> > > > > > > >>>> Jake.
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> -----Original Message-----
> > > > > > > >>>> From: Matthew Holt [mailto:[EMAIL PROTECTED]
> > > > > > > >>>> Sent: Wed 8/2/2006 4:58 PM
> > > > > > > >>>> To: [email protected]
> > > > > > > >>>> Subject: Querying Fields
> > > > > > > >>>>  I am unable to query fields in my index in the method that
> > has
> > > > > > > >>>> been suggested. I used Luke to examine my index and the
> > > > following
> > > > > > > >>>> field types exist:
> > > > > > > >>>> anchor, boost, content, contentLength, date, digest, host,
> > > > > > > >>>> lastModified, primaryType, segment, site, subType, title,
> > type,
> > > > url
> > > > > > > >>>>
> > > > > > > >>>> However, when I do a search using one of the fields,
> > followed
> > > > by a
> > > > > > > >>>> colon, an incorrect result is returned. I used Luke to find
> > the
> > > > top
> > > > > > > >>>> term in the date field which is '20060801'. I then searched
> > > > using
> > > > > > > >>>> the following query:
> > > > > > > >>>> date: 20060801
> > > > > > > >>>>
> > > > > > > >>>> Unfortunately, nothing was returned. The correct plugins
> > are
> > > > > > > >>>> enabled, here is an excerpt from my nutch-site.xml:
> > > > > > > >>>>
> > > > > > > >>>> <property>
> > > > > > > >>>>   <name>plugin.includes</name>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > >
> > > >
> > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>   <description>Regular expression naming plugin directory
> > names
> > > > to
> > > > > > > >>>>   include.  Any plugin not matching this expression is
> > > > excluded.
> > > > > > > >>>>   In any case you need at least include the
> > > > nutch-extensionpoints
> > > > > > > >>>> plugin. By
> > > > > > > >>>>   default Nutch includes crawling just HTML and plain text
> > via
> > > > > > HTTP,
> > > > > > > >>>>   and basic indexing and search plugins.
> > > > > > > >>>>   </description>
> > > > > > > >>>> </property>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> Any ideas? I'm not the only one having the same problem, I
> > saw
> > > > an
> > > > > > > >>>> earlier mailing list post but couldn't find any resolve...
> > > > Thanks,
> > > > > > > >>>>
> > > > > > > >>>>    Matt
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Lourival Junior
> > > > > Universidade Federal do Pará
> > > > > Curso de Bacharelado em Sistemas de Informação
> > > > > http://www.ufpa.br/cbsi
> > > > > Msn: [EMAIL PROTECTED]
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Lourival Junior
> > > Universidade Federal do Pará
> > > Curso de Bacharelado em Sistemas de Informação
> > > http://www.ufpa.br/cbsi
> > > Msn: [EMAIL PROTECTED]
> > >
> > >
> >
>
>
>
> --
> Lourival Junior
> Universidade Federal do Pará
> Curso de Bacharelado em Sistemas de Informação
> http://www.ufpa.br/cbsi
> Msn: [EMAIL PROTECTED]
>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Querying Fields

Reply via email to