Hello,

In the index, I have quite a few fields that are extracted from html meta
tags. As an example, I have a field called "authors" which contains the name
of the author of the document. On any given HTML page that I crawl, we can
have something like the following:

<meta name="authors" content="; jsmith ; jdoe ;" />

I can now do authors:jsmith in Nutch's web interface and it would return
that document with no issues. Here is where this can be a problem. Say we
have two pages:

<meta name="authors" content="; jsmith ;" /> for the first page
<meta name="authors" content="; smith ;" /> for the second page

If I do authors:smith, I will get both of these documents. What I want to be
able to do is to retrieve only the document with smith in the authors tag,
not any other word with "smith" somewhere in it (ie jsmith, ksmith,
jsmithers). Currently in the index, the tags are stored without the
semicolons in the field, but that shouldn't be a big change.

Would it be possible to make Nutch do that? Can you make it retrieve
information as-is rather than match the word anywhere? If so, is it a global
change or is it something that can be like an option that we can selectively
choose? Can we use regular expressions? I've attempted to explore some
things but I didn't go too far with it, probably because I don't have a very
thorough knowledge with the inner workings of Nutch.

Note- searching the index using Luke returns the correct behaviour. I guess
authors:smith does not go through any filters or anything like that, and
hits the index directly. I want the same behaviour in Nutch.

Thanks, I hope you guys can help me out with this.

Cheers
-- 
View this message in context: 
http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21625294.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to