Re: Limiting searching on fields

ahammad Mon, 26 Jan 2009 08:24:34 -0800

Yea, Nutch probably does something, because Luke results are very different.
For instance, if I put this:


authors:james authors:kent

In Luke, I get all the articles written by james and all the articles
written by kent, as well as the articles written by both. In the Nutch web
interface, I get results that contain both james and kent as authors.
Effectively, in Luke, we get james AND kent as well as james OR kent. As far
as I know, Nutch doesn't support OR queries.

Speaking of the OR queries, any idea on whether or not it'll be supported in
Nutch's next release?

Cheers


Doğacan Güney-3 wrote:
> 
> On Mon, Jan 26, 2009 at 4:00 PM, ahammad <[email protected]> wrote:
>>
>> Hello,
>>
>> Anyone have any ideas regarding this? I've been trying a few things over
>> the
>> weekend. Everything I tried seemed to have broken Nutch.
>>
>> Does anybody have any ideas on how to do this, or has anyone done this in
>> the past? Any help would be much appreciated.
>>
> 
> Nutch really is not supposed to work like that. Nutch uses lucene to
> search and
> lucene only matches full words and not individual letters. Try printing
> lucene's
> boolean query right before it hits the index.
> 
> Really strange part is that luke returns what you expect. So all I can
> think is
> that somehow resulting boolean query transforms your terms.
> 
>> Cheers
>>
>>
>>
>> ahammad wrote:
>>>
>>> Hello,
>>>
>>> In the index, I have quite a few fields that are extracted from html
>>> meta
>>> tags. As an example, I have a field called "authors" which contains the
>>> name of the author of the document. On any given HTML page that I crawl,
>>> we can have something like the following:
>>>
>>> <meta name="authors" content="; jsmith ; jdoe ;" />
>>>
>>> I can now do authors:jsmith in Nutch's web interface and it would return
>>> that document with no issues. Here is where this can be a problem. Say
>>> we
>>> have two pages:
>>>
>>> <meta name="authors" content="; jsmith ;" /> for the first page
>>> <meta name="authors" content="; smith ;" /> for the second page
>>>
>>> If I do authors:smith, I will get both of these documents. What I want
>>> to
>>> be able to do is to retrieve only the document with smith in the authors
>>> tag, not any other word with "smith" somewhere in it (ie jsmith, ksmith,
>>> jsmithers). Currently in the index, the tags are stored without the
>>> semicolons in the field, but that shouldn't be a big change.
>>>
>>> Would it be possible to make Nutch do that? Can you make it retrieve
>>> information as-is rather than match the word anywhere? If so, is it a
>>> global change or is it something that can be like an option that we can
>>> selectively choose? Can we use regular expressions? I've attempted to
>>> explore some things but I didn't go too far with it, probably because I
>>> don't have a very thorough knowledge with the inner workings of Nutch.
>>>
>>> Note- searching the index using Luke returns the correct behaviour. I
>>> guess authors:smith does not go through any filters or anything like
>>> that,
>>> and hits the index directly. I want the same behaviour in Nutch.
>>>
>>> Thanks, I hope you guys can help me out with this.
>>>
>>> Cheers
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21665869.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Doğacan Güney
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21668416.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Limiting searching on fields

Reply via email to