On Mon, Jan 26, 2009 at 4:00 PM, ahammad <[email protected]> wrote: > > Hello, > > Anyone have any ideas regarding this? I've been trying a few things over the > weekend. Everything I tried seemed to have broken Nutch. > > Does anybody have any ideas on how to do this, or has anyone done this in > the past? Any help would be much appreciated. >
Nutch really is not supposed to work like that. Nutch uses lucene to search and lucene only matches full words and not individual letters. Try printing lucene's boolean query right before it hits the index. Really strange part is that luke returns what you expect. So all I can think is that somehow resulting boolean query transforms your terms. > Cheers > > > > ahammad wrote: >> >> Hello, >> >> In the index, I have quite a few fields that are extracted from html meta >> tags. As an example, I have a field called "authors" which contains the >> name of the author of the document. On any given HTML page that I crawl, >> we can have something like the following: >> >> <meta name="authors" content="; jsmith ; jdoe ;" /> >> >> I can now do authors:jsmith in Nutch's web interface and it would return >> that document with no issues. Here is where this can be a problem. Say we >> have two pages: >> >> <meta name="authors" content="; jsmith ;" /> for the first page >> <meta name="authors" content="; smith ;" /> for the second page >> >> If I do authors:smith, I will get both of these documents. What I want to >> be able to do is to retrieve only the document with smith in the authors >> tag, not any other word with "smith" somewhere in it (ie jsmith, ksmith, >> jsmithers). Currently in the index, the tags are stored without the >> semicolons in the field, but that shouldn't be a big change. >> >> Would it be possible to make Nutch do that? Can you make it retrieve >> information as-is rather than match the word anywhere? If so, is it a >> global change or is it something that can be like an option that we can >> selectively choose? Can we use regular expressions? I've attempted to >> explore some things but I didn't go too far with it, probably because I >> don't have a very thorough knowledge with the inner workings of Nutch. >> >> Note- searching the index using Luke returns the correct behaviour. I >> guess authors:smith does not go through any filters or anything like that, >> and hits the index directly. I want the same behaviour in Nutch. >> >> Thanks, I hope you guys can help me out with this. >> >> Cheers >> > > -- > View this message in context: > http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21665869.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Doğacan Güney
