Re: [Nutch-general] Querying Fields

Matthew Holt Wed, 09 Aug 2006 08:39:53 -0700

Ignore the last email. I ended up doing the same as Benjamin Higgins. 
Works great, use his email for reference if you are trying to accomplish 
the same thing.
Matt


Matthew Holt wrote:
> Thanks for the reply. I've added the plugins you suggested. However, 
> some of the plugins need to be modified to search for fields such as 
> date (see previous email from Benjamin Higgins). I am currently 
> modifying the query-basic DateQueryFilter.java so one is allowed to 
> add query.date.boost to the nutch-site.xml to enable the date field 
> search.
>
> I'll try and post my results, or commit them.
> Matt
>
> Lukas Vlcek wrote:
>> Hi,
>>
>> To allow more formats to be indexed you need to modify nutch-site.xml
>> and update/add plugin.includes property (see nutch-default.xml for
>> default settings). The following is what I have in nutch-site.xml:
>>
>> <property>
>>  <name>plugin.includes</name>
>> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|rtf|html|js|msword|mspowerpoint|msexcel|pdf|zip|rss)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic</value>
>>  
>>
>> </property>
>>
>> [parse-*] is used to parse various formats, [query-more] allows you to
>> use [type:] filter in nutch queries.
>>
>> Regards,
>> Lukas
>>
>> On 8/9/06, Lourival Júnior <[EMAIL PROTECTED]> wrote:
>>> Hi Lukas and everybody!
>>>
>>> Do you know which file in nutch 0.7.2 should I edit to add some 
>>> field in my
>>> index (i.e. file type - PDF, Word or html)?'
>>>
>>> On 8/8/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I am not sure if I can give you any useful hint but the follwoing is
>>> > what once worked for me.
>>> > Example of query: url:http date:20060801
>>> >
>>> > date: and type: options can be used in combination with url:
>>> > Filer url:http should select all documents (unless you allowed file,
>>> > ftp protocols). Plain date ot type filter select onthing if they are
>>> > used alone.
>>> >
>>> > And be sure you don't introduce any space between filter name and its
>>> > value ([date: 20060801] is not the same as [date:20060801])
>>> >
>>> > Lukas
>>> >
>>> > On 8/8/06, Matthew Holt <[EMAIL PROTECTED]> wrote:
>>> > > Howie,
>>> > >    I inspected my index using Luke and 20060801 shows up several 
>>> times
>>> > > in the index. I'm unable to query pretty much any field. Several 
>>> people
>>> > > seem to be having the same problem. Does anyone know whats going 
>>> on?
>>> > >
>>> > > This is one of the last things I have to resolve to have Nutch 
>>> deployed
>>> > > successfully at my organization. Unfortunately, Friday is my 
>>> last day.
>>> > > Can anyone offer any assistance??
>>> > > Thanks,
>>> > >   Matt
>>> > >
>>> > > Howie Wang wrote:
>>> > > > I think that I have problems querying for numbers and
>>> > > > words with digits in them. Now that I think of it, is it
>>> > > > possible it has something to do with the stemming in
>>> > > > either the query filter or indexing? In either case, I would
>>> > > > print out the text that is being indexed and the phrases
>>> > > > added to the query. You could also using luke to inspect
>>> > > > your index and see whether 20060801 shows up anywhere.
>>> > > >
>>> > > > Howie
>>> > > >
>>> > > >> I tried looked for a page that had the date 20060801 and the 
>>> text
>>> > > >> "test" in the page. I tried the following:
>>> > > >>
>>> > > >> date: 20060801 test
>>> > > >>
>>> > > >> and
>>> > > >>
>>> > > >> date 20060721-20060803 test
>>> > > >>
>>> > > >> Neither worked, any ideas??
>>> > > >>
>>> > > >> Matt
>>> > > >>
>>> > > >> Matthew Holt wrote:
>>> > > >>> Thanks Jake,
>>> > > >>>   However, it seems to me that it makes most sense that a query
>>> > > >>> should return all pages that match the query, instead of 
>>> acting as a
>>> > > >>> content filter. However, I know its something easy to 
>>> suggest when
>>> > > >>> you're not having to implement it, so just a suggestion.
>>> > > >>>
>>> > > >>> Matt
>>> > > >>>
>>> > > >>> Vanderdray, Jacob wrote:
>>> > > >>>> Try querying with both the date and something you'd expect 
>>> to find
>>> > > >>>> in the content.  The field query filter is just a filter.  
>>> It only
>>> > > >>>> restricts your results to things that match the basic query 
>>> and has
>>> > > >>>> the contents you require in the field.  So if you query for
>>> > > >>>> "date:2006080 text" you'll be searching for documents that 
>>> contain
>>> > > >>>> "text" in one of the default query fields and has the value 
>>> 2006080
>>> > > >>>> in the date field.  Leaving out text in that example would
>>> > > >>>> essentially be asking for nothing in the default fields and 
>>> 2006080
>>> > > >>>> in the date field which is why it doesn't return any results.
>>> > > >>>>
>>> > > >>>> Hope that helps,
>>> > > >>>> Jake.
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> -----Original Message-----
>>> > > >>>> From: Matthew Holt [mailto:[EMAIL PROTECTED]
>>> > > >>>> Sent: Wed 8/2/2006 4:58 PM
>>> > > >>>> To: [email protected]
>>> > > >>>> Subject: Querying Fields
>>> > > >>>>  I am unable to query fields in my index in the method that 
>>> has
>>> > > >>>> been suggested. I used Luke to examine my index and the 
>>> following
>>> > > >>>> field types exist:
>>> > > >>>> anchor, boost, content, contentLength, date, digest, host,
>>> > > >>>> lastModified, primaryType, segment, site, subType, title, 
>>> type, url
>>> > > >>>>
>>> > > >>>> However, when I do a search using one of the fields, 
>>> followed by a
>>> > > >>>> colon, an incorrect result is returned. I used Luke to find 
>>> the top
>>> > > >>>> term in the date field which is '20060801'. I then searched 
>>> using
>>> > > >>>> the following query:
>>> > > >>>> date: 20060801
>>> > > >>>>
>>> > > >>>> Unfortunately, nothing was returned. The correct plugins are
>>> > > >>>> enabled, here is an excerpt from my nutch-site.xml:
>>> > > >>>>
>>> > > >>>> <property>
>>> > > >>>>   <name>plugin.includes</name>
>>> > > >>>>
>>> > > >>>>
>>> > 
>>> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|oo|pdf|msword|mspowerpoint|rtf|zip)|index-(basic|more)|query-(more|site|stemmer|url)|summary-basic|scoring-opic</value>
>>>  
>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>   <description>Regular expression naming plugin directory 
>>> names to
>>> > > >>>>   include.  Any plugin not matching this expression is 
>>> excluded.
>>> > > >>>>   In any case you need at least include the 
>>> nutch-extensionpoints
>>> > > >>>> plugin. By
>>> > > >>>>   default Nutch includes crawling just HTML and plain text via
>>> > HTTP,
>>> > > >>>>   and basic indexing and search plugins.
>>> > > >>>>   </description>
>>> > > >>>> </property>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> Any ideas? I'm not the only one having the same problem, I 
>>> saw an
>>> > > >>>> earlier mailing list post but couldn't find any resolve... 
>>> Thanks,
>>> > > >>>>
>>> > > >>>>    Matt
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>>>
>>>
>>> -- 
>>> Lourival Junior
>>> Universidade Federal do Pará
>>> Curso de Bacharelado em Sistemas de Informação
>>> http://www.ufpa.br/cbsi
>>> Msn: [EMAIL PROTECTED]
>>>
>>>
>>
>

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Querying Fields

Reply via email to