[EMAIL PROTECTED] wrote:
This is because Nutch turns those common terms into ngrams (not sure of what 
size), and that increases the size of the index.
For example, if you have a phrase like:

  vacation time

Normally, Nutch will index this phrase as 2 terms, a total of 12 characters 
(probably less, if these words are stemmed)
If those two words are defined as common terms, and Nutch indexes them as 
ngrams (say bigrams), it will index something like this:

  va ac ca at ti io on ti im me

No, Nutch uses word-level ngrams (where n=2), so using this example it would be:

   vacation-time

Or, using a better example:

   "words in common"

becomes:

   words-in in-common

This is especially useful in case of phrase queries, because it drastically reduces the number of unique terms to check (lowers the term frequency in the index, hence the number of postings to check). This is at the cost of increasing somewhat the index size.

You can clearly see the effects of this file if you run your query through Nutch query parser:

   bin/nutch org.apache.nutch.searcher.Query

Give it a phrase query (surrounded by double quotes) containing one of common terms, and see what happens.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to