[Nutch-general] Re: Common Terms

Rajesh Munavalli Thu, 30 Mar 2006 15:04:03 -0800

I guess you would need them in phrase query. If you do not index them, you
would never be able to retrieve something like "the americas".


--Rajesh Munavalli
 Blog: http://mathsearch.blogspot.com


On 3/30/06, Vanderdray, Jacob <[EMAIL PROTECTED]> wrote:
>
>        Thanks.  That gives me the full list.  The odd thing to me is
> that none of those words will end up being effective in a search, so why
> not strip them all out during indexing?
>
> Thanks again,
> Jake.
>
> -----Original Message-----
> From: Rajesh Munavalli [mailto:[EMAIL PROTECTED]
> Sent: Thursday, March 30, 2006 5:24 PM
> To: [email protected]
> Subject: Re: Common Terms
>
> There is a list of stop words in NutchAnalysis class
> (org.apache.nutch.analysis). I guess thats where the common terms are
> removed during analysis.
>
> --Rajesh Munavalli
> Blog: http://mathsearch.blogspot.com
>
> Vanderdray, Jacob wrote:
> >       I've added some code to query-basic to log the query after it
> > has run both addTerms and addPhrases.  This helps me to better
> > understand what's going on.  I've noticed that when my search contains
> > words like "the" or "a", those don't appear in the actual query.
> >
> >       It looks to me like the common-terms.utf8 file is supposed to be
> > used to strip common words like "the" out of queries for specific
> > fields, but that doesn't seem to be what's happening.  The term "the"
> > ends up getting stripped out of the query for all fields (url,
> content,
> > anchor, etc.).  I even tried removing "the" from the common-terms.utf8
> > file, but didn't see any change in behavior.
> >
> >       Does this file only get used when indexing?  If so what
> > determines which words get stripped out of searches?
> >
> > Thanks,
> > Jake.
> >
> >
>
>

[Nutch-general] Re: Common Terms

Reply via email to