I guess you would need them in phrase query. If you do not index them, you would never be able to retrieve something like "the americas".
--Rajesh Munavalli Blog: http://mathsearch.blogspot.com On 3/30/06, Vanderdray, Jacob <[EMAIL PROTECTED]> wrote: > > Thanks. That gives me the full list. The odd thing to me is > that none of those words will end up being effective in a search, so why > not strip them all out during indexing? > > Thanks again, > Jake. > > -----Original Message----- > From: Rajesh Munavalli [mailto:[EMAIL PROTECTED] > Sent: Thursday, March 30, 2006 5:24 PM > To: [email protected] > Subject: Re: Common Terms > > There is a list of stop words in NutchAnalysis class > (org.apache.nutch.analysis). I guess thats where the common terms are > removed during analysis. > > --Rajesh Munavalli > Blog: http://mathsearch.blogspot.com > > Vanderdray, Jacob wrote: > > I've added some code to query-basic to log the query after it > > has run both addTerms and addPhrases. This helps me to better > > understand what's going on. I've noticed that when my search contains > > words like "the" or "a", those don't appear in the actual query. > > > > It looks to me like the common-terms.utf8 file is supposed to be > > used to strip common words like "the" out of queries for specific > > fields, but that doesn't seem to be what's happening. The term "the" > > ends up getting stripped out of the query for all fields (url, > content, > > anchor, etc.). I even tried removing "the" from the common-terms.utf8 > > file, but didn't see any change in behavior. > > > > Does this file only get used when indexing? If so what > > determines which words get stripped out of searches? > > > > Thanks, > > Jake. > > > > > >
