Hey, I need stemming in my search engine based on Nutch 0.7.2, the stemming query is being created but I am not getting appropriate results. If I search for hotel, I get 11 results, but if I search for hotels, I get 1 result.
Any thoughts? I have implemented stemming using the code in the mail by Howie Wang on 2005-06-22. It can found at http://www.nabble.com/RE%3A-Nutch-does-not-use-stemmers--p249520.html among many other places. To help people, I plan to add a category to my blog for Nutch, and share my knowledge with people, once I have got a decent hang of how various things work in Nutch. Thanks and Regards, Jayant Gandhi -- www.jkg.in | http://www.jkg.in/contact-me/ Jayant Kr. Gandhi M.Tech. Computer Tech. Class of 2007, IIT Delhi
060629 155824 10 parsing file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-default.xml 060629 155824 10 parsing file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-site.xml 060629 155824 10 opening merged index in C:\cygwin\home\HK\nutch-0.7.2\crawl.test5\index 060629 155824 10 Plugins: looking in: C:\cygwin\home\HK\nutch-0.7.2\build\plugins 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\clustering-carrot2 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\creativecommons 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-basic\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-more 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\language-identifier 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\nutch-extensionpoints\plugin.xml 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\ontology 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-ext 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-html\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-js 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-msword 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-pdf 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-rss 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-text\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-file 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-ftp 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-http\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-httpclient 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-basic\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-more 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-rating\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.parse.HtmlParseFilter class=org.apache.nutch.parse.rating.RatingParser 060629 155824 10 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.parse.rating.RatingIndexer 060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.parse.rating.RatingQueryFilter 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-site\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-stemmer\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.stemmer.StemmerQueryFilter 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-url\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060629 155824 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-prefix 060629 155824 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-regex\plugin.xml 060629 155824 10 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 060629 155824 10 Added a rating query o.getTerm = hotel token = hotel type = word stemmedValues = hotel o.getTerm = hotel token = hotel type = word stemmedValues = hotel o.getTerm = hotel token = hotel type = word stemmedValues = hotel o.getTerm = hotel token = hotel type = word stemmedValues = hotel query = +(url:hotel^4.0 anchor:hotel^2.0 content:hotel title:hotel^1.5 host:hotel^2.0) +(url:hotel^4.0 anchor:hotel^2.0 content:hotel title:hotel^2.0) Total hits: 27 060629 155824 11 found resource common-terms.utf8 at file:/C:/cygwin/home/HK/nutch-0.7.2/conf/common-terms.utf8 0 20060629052157/1b 1 20060629052157/2d 2 20060629052157/59 3 20060629052157/5a 4 20060629052157/86 5 20060629052157/89 6 20060629052157/8f 7 20060629052157/9a 8 20060629052157/aa 9 20060629052157/ab
060629 155851 10 parsing file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-default.xml 060629 155851 10 parsing file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-site.xml 060629 155851 10 opening merged index in C:\cygwin\home\HK\nutch-0.7.2\crawl.test5\index 060629 155851 10 Plugins: looking in: C:\cygwin\home\HK\nutch-0.7.2\build\plugins 060629 155851 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\clustering-carrot2 060629 155851 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\creativecommons 060629 155851 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-basic\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-more 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\language-identifier 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\nutch-extensionpoints\plugin.xml 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\ontology 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-ext 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-html\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-js 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-msword 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-pdf 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-rss 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-text\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-file 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-ftp 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-http\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-httpclient 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-basic\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-more 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-rating\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.parse.HtmlParseFilter class=org.apache.nutch.parse.rating.RatingParser 060629 155852 10 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.parse.rating.RatingIndexer 060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.parse.rating.RatingQueryFilter 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-site\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-stemmer\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.stemmer.StemmerQueryFilter 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-url\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 060629 155852 10 not including: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-prefix 060629 155852 10 parsing: C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-regex\plugin.xml 060629 155852 10 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 060629 155852 10 Added a rating query o.getTerm = hotels token = hotel type = word stemmedValues = hotel o.getTerm = hotels token = hotel type = word stemmedValues = hotel o.getTerm = hotels token = hotel type = word stemmedValues = hotel o.getTerm = hotels token = hotel type = word stemmedValues = hotel query = +(url:hotels^4.0 anchor:hotels^2.0 content:hotels title:hotels^1.5 host:hotels^2.0) +(url:hotel^4.0 anchor:hotel^2.0 content:hotel title:hotel^2.0) Total hits: 1 060629 155852 11 found resource common-terms.utf8 at file:/C:/cygwin/home/HK/nutch-0.7.2/conf/common-terms.utf8 0 20060629052146/0
