Re: [Nutch-general] common-terms.utf8

Andrzej Bialecki Sat, 12 Aug 2006 03:41:28 -0700

[EMAIL PROTECTED] wrote:

This is because Nutch turns those common terms into ngrams (not sure of what 
size), and that increases the size of the index.
For example, if you have a phrase like:


  vacation time

Normally, Nutch will index this phrase as 2 terms, a total of 12 characters 
(probably less, if these words are stemmed)
If those two words are defined as common terms, and Nutch indexes them as 
ngrams (say bigrams), it will index something like this:

  va ac ca at ti io on ti im me

No, Nutch uses word-level ngrams (where n=2), so using this example itwould be:


   vacation-time

Or, using a better example:

   "words in common"

becomes:

   words-in in-common

This is especially useful in case of phrase queries, because itdrastically reduces the number of unique terms to check (lowers the termfrequency in the index, hence the number of postings to check). This isat the cost of increasing somewhat the index size.

You can clearly see the effects of this file if you run your querythrough Nutch query parser:


   bin/nutch org.apache.nutch.searcher.Query

Give it a phrase query (surrounded by double quotes) containing one ofcommon terms, and see what happens.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [Nutch-general] common-terms.utf8

Reply via email to