Hello, In the index, I have quite a few fields that are extracted from html meta tags. As an example, I have a field called "authors" which contains the name of the author of the document. On any given HTML page that I crawl, we can have something like the following:
<meta name="authors" content="; jsmith ; jdoe ;" /> I can now do authors:jsmith in Nutch's web interface and it would return that document with no issues. Here is where this can be a problem. Say we have two pages: <meta name="authors" content="; jsmith ;" /> for the first page <meta name="authors" content="; smith ;" /> for the second page If I do authors:smith, I will get both of these documents. What I want to be able to do is to retrieve only the document with smith in the authors tag, not any other word with "smith" somewhere in it (ie jsmith, ksmith, jsmithers). Currently in the index, the tags are stored without the semicolons in the field, but that shouldn't be a big change. Would it be possible to make Nutch do that? Can you make it retrieve information as-is rather than match the word anywhere? If so, is it a global change or is it something that can be like an option that we can selectively choose? Can we use regular expressions? I've attempted to explore some things but I didn't go too far with it, probably because I don't have a very thorough knowledge with the inner workings of Nutch. Note- searching the index using Luke returns the correct behaviour. I guess authors:smith does not go through any filters or anything like that, and hits the index directly. I want the same behaviour in Nutch. Thanks, I hope you guys can help me out with this. Cheers -- View this message in context: http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21625294.html Sent from the Nutch - User mailing list archive at Nabble.com.
