Hey,

I need stemming in my search engine based on Nutch 0.7.2, the stemming
query is being created but I am not getting appropriate results.
If I search for hotel, I get 11 results, but if I search for hotels, I
get 1 result.

Any thoughts?

I have implemented stemming using the code in the mail by Howie Wang
on 2005-06-22. It can found at
http://www.nabble.com/RE%3A-Nutch-does-not-use-stemmers--p249520.html
among many other places.

To help people, I plan to add a category to my blog for Nutch, and
share my knowledge with people, once I have got a decent hang of how
various things work in Nutch.

Thanks and Regards,
Jayant Gandhi

--
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi
M.Tech. Computer Tech. Class of 2007,
IIT Delhi
060629 155824 10 parsing 
file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-default.xml
060629 155824 10 parsing file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-site.xml
060629 155824 10 opening merged index in 
C:\cygwin\home\HK\nutch-0.7.2\crawl.test5\index
060629 155824 10 Plugins: looking in: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\clustering-carrot2
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\creativecommons
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-basic\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-more
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\language-identifier
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\nutch-extensionpoints\plugin.xml
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\ontology
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-ext
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-html\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-js
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-msword
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-pdf
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-rss
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-text\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-file
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-ftp
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-http\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-httpclient
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-basic\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-more
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-rating\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.parse.HtmlParseFilter 
class=org.apache.nutch.parse.rating.RatingParser
060629 155824 10 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.parse.rating.RatingIndexer
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.parse.rating.RatingQueryFilter
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-site\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-stemmer\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.stemmer.StemmerQueryFilter
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-url\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
060629 155824 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-prefix
060629 155824 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-regex\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
060629 155824 10 Added a rating query
o.getTerm = hotel
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotel
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotel
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotel
token = hotel
type = word
stemmedValues = hotel
query = +(url:hotel^4.0 anchor:hotel^2.0 content:hotel title:hotel^1.5 
host:hotel^2.0) +(url:hotel^4.0 anchor:hotel^2.0 content:hotel title:hotel^2.0)
Total hits: 27
060629 155824 11 found resource common-terms.utf8 at 
file:/C:/cygwin/home/HK/nutch-0.7.2/conf/common-terms.utf8
 0 20060629052157/1b
 1 20060629052157/2d
 2 20060629052157/59
 3 20060629052157/5a
 4 20060629052157/86
 5 20060629052157/89
 6 20060629052157/8f
 7 20060629052157/9a
 8 20060629052157/aa
 9 20060629052157/ab
060629 155851 10 parsing 
file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-default.xml
060629 155851 10 parsing file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-site.xml
060629 155851 10 opening merged index in 
C:\cygwin\home\HK\nutch-0.7.2\crawl.test5\index
060629 155851 10 Plugins: looking in: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins
060629 155851 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\clustering-carrot2
060629 155851 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\creativecommons
060629 155851 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-basic\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-more
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\language-identifier
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\nutch-extensionpoints\plugin.xml
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\ontology
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-ext
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-html\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-js
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-msword
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-pdf
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-rss
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-text\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-file
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-ftp
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-http\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.http.Http
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-httpclient
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-basic\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-more
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-rating\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.parse.HtmlParseFilter 
class=org.apache.nutch.parse.rating.RatingParser
060629 155852 10 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.parse.rating.RatingIndexer
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.parse.rating.RatingQueryFilter
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-site\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-stemmer\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.stemmer.StemmerQueryFilter
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-url\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
060629 155852 10 not including: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-prefix
060629 155852 10 parsing: 
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-regex\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
060629 155852 10 Added a rating query
o.getTerm = hotels
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotels
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotels
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotels
token = hotel
type = word
stemmedValues = hotel
query = +(url:hotels^4.0 anchor:hotels^2.0 content:hotels title:hotels^1.5 
host:hotels^2.0) +(url:hotel^4.0 anchor:hotel^2.0 content:hotel title:hotel^2.0)
Total hits: 1
060629 155852 11 found resource common-terms.utf8 at 
file:/C:/cygwin/home/HK/nutch-0.7.2/conf/common-terms.utf8
 0 20060629052146/0

Reply via email to