Hannu,
Do you use same set of QueryFilters both in the webapp and when running
from shell?
Perhaps your filter is not executed when running from cli? You can
verify how your query is transformed by running bin/nutch
org.apache.nutch.searcher.Query and entering some queries.
--
Sami Siren
Hannu Väisänen wrote:
I am using Nutch 1.0 to index files written in Finnish.
I have written a filter MorphologyHVSuggestionFilter that converts
Finnish words to a base form (that you find in dictionaries) and
I index just the base forms so that I find all inflected forms
when searching just for the base form.
When I search for the word 'kuka' like this
bin/nutch org.apache.nutch.searcher.NutchBean kuka
Total hits: 245
Tomcat6 finds also 245 hits.
But when I search for word 'kuusi'
bin/nutch org.apache.nutch.searcher.NutchBean kuusi
Total hits: 212
Tomcat6 finds only 14 hits.
Tomcat6 log shows this for word 'kuka':
2010-02-16 21:25:40,909 INFO NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:25:40,909 INFO NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token1 (kuka,0,4)
2010-02-16 21:25:40,910 DEBUG MorphologyHVSuggestionFilter - Token2 (kuka,0,4)
2010-02-16 21:25:40,910 INFO NutchBean - query: kuka
2010-02-16 21:25:40,910 INFO NutchBean - query: kuka
2010-02-16 21:25:40,910 INFO NutchBean - lang: fi
2010-02-16 21:25:40,910 INFO NutchBean - lang: fi
2010-02-16 21:25:40,911 INFO NutchBean - searching for 20 raw hits
2010-02-16 21:25:40,911 INFO NutchBean - searching for 20 raw hits
2010-02-16 21:25:40,939 INFO NutchBean - re-searching for 40 raw hits, query: kuka
-site:""
2010-02-16 21:25:40,939 INFO NutchBean - re-searching for 40 raw hits, query: kuka
-site:""
2010-02-16 21:25:40,941 INFO NutchBean - found 0 raw hits
2010-02-16 21:25:40,941 INFO NutchBean - found 0 raw hits
2010-02-16 21:25:40,969 INFO NutchBean - total hits: 245
2010-02-16 21:25:40,969 INFO NutchBean - total hits: 245
Tomcat6 log shows this for word 'kuusi':
2010-02-16 21:23:12,777 INFO NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:23:12,777 INFO NutchBean - query request from 0:0:0:0:0:0:0:1
2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token1 (kuusi,0,5)
2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2 (kuu,0,5)
2010-02-16 21:23:12,778 DEBUG MorphologyHVSuggestionFilter - Token2
(kuusi,0,0,posIncr=0)
2010-02-16 21:23:12,778 INFO NutchBean - query: kuusi
2010-02-16 21:23:12,778 INFO NutchBean - query: kuusi
2010-02-16 21:23:12,778 INFO NutchBean - lang: fi
2010-02-16 21:23:12,778 INFO NutchBean - lang: fi
2010-02-16 21:23:12,780 INFO NutchBean - searching for 20 raw hits
2010-02-16 21:23:12,780 INFO NutchBean - searching for 20 raw hits
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing "kuu kuusi" for url
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing "kuu kuusi" for anchor
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing "kuu kuusi" for content
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing "kuu kuusi" for title
2010-02-16 21:23:12,780 DEBUG CommonGrams - Optimizing "kuu kuusi" for host
2010-02-16 21:23:12,813 INFO NutchBean - total hits: 14
2010-02-16 21:23:12,813 INFO NutchBean - total hits: 14
The difference between words 'kuka' and 'kuusi' is that the word 'kuka'
has only one base form (which happens to be 'kuka') but the word
'kuusi' has two base forms 'kuusi' and 'kuu' ('moon'; 'si' is a
possessive suffix).
So is it possible that when I search through tomcat6 Nutch returns
only those files that have both words 'kuusi' and 'kuu'. If so, how
can I change this that it finds files that has either 'kuusi' or 'kuu'
(or, of course, any other base forms of the word I search for :-).