Yea, Nutch probably does something, because Luke results are very different. For instance, if I put this:
authors:james authors:kent In Luke, I get all the articles written by james and all the articles written by kent, as well as the articles written by both. In the Nutch web interface, I get results that contain both james and kent as authors. Effectively, in Luke, we get james AND kent as well as james OR kent. As far as I know, Nutch doesn't support OR queries. Speaking of the OR queries, any idea on whether or not it'll be supported in Nutch's next release? Cheers Doğacan Güney-3 wrote: > > On Mon, Jan 26, 2009 at 4:00 PM, ahammad <[email protected]> wrote: >> >> Hello, >> >> Anyone have any ideas regarding this? I've been trying a few things over >> the >> weekend. Everything I tried seemed to have broken Nutch. >> >> Does anybody have any ideas on how to do this, or has anyone done this in >> the past? Any help would be much appreciated. >> > > Nutch really is not supposed to work like that. Nutch uses lucene to > search and > lucene only matches full words and not individual letters. Try printing > lucene's > boolean query right before it hits the index. > > Really strange part is that luke returns what you expect. So all I can > think is > that somehow resulting boolean query transforms your terms. > >> Cheers >> >> >> >> ahammad wrote: >>> >>> Hello, >>> >>> In the index, I have quite a few fields that are extracted from html >>> meta >>> tags. As an example, I have a field called "authors" which contains the >>> name of the author of the document. On any given HTML page that I crawl, >>> we can have something like the following: >>> >>> <meta name="authors" content="; jsmith ; jdoe ;" /> >>> >>> I can now do authors:jsmith in Nutch's web interface and it would return >>> that document with no issues. Here is where this can be a problem. Say >>> we >>> have two pages: >>> >>> <meta name="authors" content="; jsmith ;" /> for the first page >>> <meta name="authors" content="; smith ;" /> for the second page >>> >>> If I do authors:smith, I will get both of these documents. What I want >>> to >>> be able to do is to retrieve only the document with smith in the authors >>> tag, not any other word with "smith" somewhere in it (ie jsmith, ksmith, >>> jsmithers). Currently in the index, the tags are stored without the >>> semicolons in the field, but that shouldn't be a big change. >>> >>> Would it be possible to make Nutch do that? Can you make it retrieve >>> information as-is rather than match the word anywhere? If so, is it a >>> global change or is it something that can be like an option that we can >>> selectively choose? Can we use regular expressions? I've attempted to >>> explore some things but I didn't go too far with it, probably because I >>> don't have a very thorough knowledge with the inner workings of Nutch. >>> >>> Note- searching the index using Luke returns the correct behaviour. I >>> guess authors:smith does not go through any filters or anything like >>> that, >>> and hits the index directly. I want the same behaviour in Nutch. >>> >>> Thanks, I hope you guys can help me out with this. >>> >>> Cheers >>> >> >> -- >> View this message in context: >> http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21665869.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > > -- > Doğacan Güney > > -- View this message in context: http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21668416.html Sent from the Nutch - User mailing list archive at Nabble.com.
