On Mon, Jan 26, 2009 at 9:54 PM, ahammad <[email protected]> wrote:
> > Yea, Nutch probably does something, because Luke results are very > different. > For instance, if I put this: > > authors:james authors:kent > Just print the lucene query in the NUTCH and copy/paste it in luke. then below search-text area in luke, use Update and explain tab to understand the query structure. > > In Luke, I get all the articles written by james and all the articles > written by kent, as well as the articles written by both. In the Nutch web > interface, I get results that contain both james and kent as authors. > Effectively, in Luke, we get james AND kent as well as james OR kent. As > far > as I know, Nutch doesn't support OR queries. > > It does support OR queries. You should look at the query structure of > lucene(there is something call SHOULD and MUST term in the query structure). > > Speaking of the OR queries, any idea on whether or not it'll be supported > in > > Nutch's next release? > > > Cheers > > > Doğacan Güney-3 wrote: > > > > On Mon, Jan 26, 2009 at 4:00 PM, ahammad <[email protected]> wrote: > >> > >> Hello, > >> > >> Anyone have any ideas regarding this? I've been trying a few things over > >> the > >> weekend. Everything I tried seemed to have broken Nutch. > >> > >> Does anybody have any ideas on how to do this, or has anyone done this > in > >> the past? Any help would be much appreciated. > >> > > > > Nutch really is not supposed to work like that. Nutch uses lucene to > > search and > > lucene only matches full words and not individual letters. Try printing > > lucene's > > boolean query right before it hits the index. > > > > Really strange part is that luke returns what you expect. So all I can > > think is > > that somehow resulting boolean query transforms your terms. > > > >> Cheers > >> > >> > >> > >> ahammad wrote: > >>> > >>> Hello, > >>> > >>> In the index, I have quite a few fields that are extracted from html > >>> meta > >>> tags. As an example, I have a field called "authors" which contains the > >>> name of the author of the document. On any given HTML page that I > crawl, > >>> we can have something like the following: > >>> > >>> <meta name="authors" content="; jsmith ; jdoe ;" /> > >>> > >>> I can now do authors:jsmith in Nutch's web interface and it would > return > >>> that document with no issues. Here is where this can be a problem. Say > >>> we > >>> have two pages: > >>> > >>> <meta name="authors" content="; jsmith ;" /> for the first page > >>> <meta name="authors" content="; smith ;" /> for the second page > >>> > >>> If I do authors:smith, I will get both of these documents. What I want > >>> to > >>> be able to do is to retrieve only the document with smith in the > authors > >>> tag, not any other word with "smith" somewhere in it (ie jsmith, > ksmith, > >>> jsmithers). Currently in the index, the tags are stored without the > >>> semicolons in the field, but that shouldn't be a big change. > >>> > >>> Would it be possible to make Nutch do that? Can you make it retrieve > >>> information as-is rather than match the word anywhere? If so, is it a > >>> global change or is it something that can be like an option that we can > >>> selectively choose? Can we use regular expressions? I've attempted to > >>> explore some things but I didn't go too far with it, probably because I > >>> don't have a very thorough knowledge with the inner workings of Nutch. > >>> > >>> Note- searching the index using Luke returns the correct behaviour. I > >>> guess authors:smith does not go through any filters or anything like > >>> that, > >>> and hits the index directly. I want the same behaviour in Nutch. > >>> > >>> Thanks, I hope you guys can help me out with this. > >>> > >>> Cheers > >>> > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21665869.html > >> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > > > > > > > > -- > > Doğacan Güney > > > > > > -- > View this message in context: > http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21668416.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Thanks and Regards, Vishal Vachhani M.tech, CSE dept Indian Institute of Technology, Bombay http://www.cse.iitb.ac.in/~vishalv
