On Mon, Jan 26, 2009 at 9:54 PM, ahammad <[email protected]> wrote:

>
> Yea, Nutch probably does something, because Luke results are very
> different.
> For instance, if I put this:
>
> authors:james authors:kent
>

Just print the lucene query in the NUTCH and copy/paste it in luke. then
below search-text area in luke, use Update and explain tab to understand the
query structure.



>
> In Luke, I get all the articles written by james and all the articles
> written by kent, as well as the articles written by both. In the Nutch web
> interface, I get results that contain both james and kent as authors.
> Effectively, in Luke, we get james AND kent as well as james OR kent. As
> far
> as I know, Nutch doesn't support OR queries.
>


> It does support OR queries. You should look at the query structure of
> lucene(there is something call SHOULD and MUST term in the query structure).
>


> Speaking of the OR queries, any idea on whether or not it'll be supported
> in
>


> Nutch's next release?
>
>


> Cheers
>
>
> Doğacan Güney-3 wrote:
> >
> > On Mon, Jan 26, 2009 at 4:00 PM, ahammad <[email protected]> wrote:
> >>
> >> Hello,
> >>
> >> Anyone have any ideas regarding this? I've been trying a few things over
> >> the
> >> weekend. Everything I tried seemed to have broken Nutch.
> >>
> >> Does anybody have any ideas on how to do this, or has anyone done this
> in
> >> the past? Any help would be much appreciated.
> >>
> >
> > Nutch really is not supposed to work like that. Nutch uses lucene to
> > search and
> > lucene only matches full words and not individual letters. Try printing
> > lucene's
> > boolean query right before it hits the index.
> >
> > Really strange part is that luke returns what you expect. So all I can
> > think is
> > that somehow resulting boolean query transforms your terms.
> >
> >> Cheers
> >>
> >>
> >>
> >> ahammad wrote:
> >>>
> >>> Hello,
> >>>
> >>> In the index, I have quite a few fields that are extracted from html
> >>> meta
> >>> tags. As an example, I have a field called "authors" which contains the
> >>> name of the author of the document. On any given HTML page that I
> crawl,
> >>> we can have something like the following:
> >>>
> >>> <meta name="authors" content="; jsmith ; jdoe ;" />
> >>>
> >>> I can now do authors:jsmith in Nutch's web interface and it would
> return
> >>> that document with no issues. Here is where this can be a problem. Say
> >>> we
> >>> have two pages:
> >>>
> >>> <meta name="authors" content="; jsmith ;" /> for the first page
> >>> <meta name="authors" content="; smith ;" /> for the second page
> >>>
> >>> If I do authors:smith, I will get both of these documents. What I want
> >>> to
> >>> be able to do is to retrieve only the document with smith in the
> authors
> >>> tag, not any other word with "smith" somewhere in it (ie jsmith,
> ksmith,
> >>> jsmithers). Currently in the index, the tags are stored without the
> >>> semicolons in the field, but that shouldn't be a big change.
> >>>
> >>> Would it be possible to make Nutch do that? Can you make it retrieve
> >>> information as-is rather than match the word anywhere? If so, is it a
> >>> global change or is it something that can be like an option that we can
> >>> selectively choose? Can we use regular expressions? I've attempted to
> >>> explore some things but I didn't go too far with it, probably because I
> >>> don't have a very thorough knowledge with the inner workings of Nutch.
> >>>
> >>> Note- searching the index using Luke returns the correct behaviour. I
> >>> guess authors:smith does not go through any filters or anything like
> >>> that,
> >>> and hits the index directly. I want the same behaviour in Nutch.
> >>>
> >>> Thanks, I hope you guys can help me out with this.
> >>>
> >>> Cheers
> >>>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21665869.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> >
> > --
> > Doğacan Güney
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Limiting-searching-on-fields-tp21625294p21668416.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Thanks and Regards,
Vishal Vachhani
M.tech, CSE dept
Indian Institute of Technology, Bombay
http://www.cse.iitb.ac.in/~vishalv

Reply via email to