[EMAIL PROTECTED] wrote:
This is because Nutch turns those common terms into ngrams (not sure of what
size), and that increases the size of the index.
For example, if you have a phrase like:
vacation time
Normally, Nutch will index this phrase as 2 terms, a total of 12 characters
(probably less, if these words are stemmed)
If those two words are defined as common terms, and Nutch indexes them as
ngrams (say bigrams), it will index something like this:
va ac ca at ti io on ti im me
No, Nutch uses word-level ngrams (where n=2), so using this example it
would be:
vacation-time
Or, using a better example:
"words in common"
becomes:
words-in in-common
This is especially useful in case of phrase queries, because it
drastically reduces the number of unique terms to check (lowers the term
frequency in the index, hence the number of postings to check). This is
at the cost of increasing somewhat the index size.
You can clearly see the effects of this file if you run your query
through Nutch query parser:
bin/nutch org.apache.nutch.searcher.Query
Give it a phrase query (surrounded by double quotes) containing one of
common terms, and see what happens.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com