Hey,
I need stemming in my search engine based on Nutch 0.7.2, the stemming
query is being created but I am not getting appropriate results.
If I search for hotel, I get 11 results, but if I search for hotels, I
get 1 result.
Any thoughts?
I have implemented stemming using the code in the mail by Howie Wang
on 2005-06-22. It can found at
http://www.nabble.com/RE%3A-Nutch-does-not-use-stemmers--p249520.html
among many other places.
To help people, I plan to add a category to my blog for Nutch, and
share my knowledge with people, once I have got a decent hang of how
various things work in Nutch.
Thanks and Regards,
Jayant Gandhi
--
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi
M.Tech. Computer Tech. Class of 2007,
IIT Delhi
060629 155824 10 parsing
file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-default.xml
060629 155824 10 parsing file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-site.xml
060629 155824 10 opening merged index in
C:\cygwin\home\HK\nutch-0.7.2\crawl.test5\index
060629 155824 10 Plugins: looking in:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\clustering-carrot2
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\creativecommons
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-basic\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-more
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\language-identifier
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\nutch-extensionpoints\plugin.xml
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\ontology
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-ext
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-html\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-js
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-msword
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-pdf
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-rss
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-text\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-file
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-ftp
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-http\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-httpclient
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-basic\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-more
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-rating\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.parse.HtmlParseFilter
class=org.apache.nutch.parse.rating.RatingParser
060629 155824 10 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.parse.rating.RatingIndexer
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.parse.rating.RatingQueryFilter
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-site\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-stemmer\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.stemmer.StemmerQueryFilter
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-url\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060629 155824 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-prefix
060629 155824 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-regex\plugin.xml
060629 155824 10 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
060629 155824 10 Added a rating query
o.getTerm = hotel
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotel
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotel
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotel
token = hotel
type = word
stemmedValues = hotel
query = +(url:hotel^4.0 anchor:hotel^2.0 content:hotel title:hotel^1.5
host:hotel^2.0) +(url:hotel^4.0 anchor:hotel^2.0 content:hotel title:hotel^2.0)
Total hits: 27
060629 155824 11 found resource common-terms.utf8 at
file:/C:/cygwin/home/HK/nutch-0.7.2/conf/common-terms.utf8
0 20060629052157/1b
1 20060629052157/2d
2 20060629052157/59
3 20060629052157/5a
4 20060629052157/86
5 20060629052157/89
6 20060629052157/8f
7 20060629052157/9a
8 20060629052157/aa
9 20060629052157/ab
060629 155851 10 parsing
file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-default.xml
060629 155851 10 parsing file:/C:/cygwin/home/HK/nutch-0.7.2/conf/nutch-site.xml
060629 155851 10 opening merged index in
C:\cygwin\home\HK\nutch-0.7.2\crawl.test5\index
060629 155851 10 Plugins: looking in:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins
060629 155851 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\clustering-carrot2
060629 155851 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\creativecommons
060629 155851 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-basic\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\index-more
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\language-identifier
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\nutch-extensionpoints\plugin.xml
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\ontology
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-ext
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-html\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-js
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-msword
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-pdf
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-rss
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\parse-text\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-file
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-ftp
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-http\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\protocol-httpclient
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-basic\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-more
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-rating\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.parse.HtmlParseFilter
class=org.apache.nutch.parse.rating.RatingParser
060629 155852 10 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.parse.rating.RatingIndexer
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.parse.rating.RatingQueryFilter
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-site\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-stemmer\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.stemmer.StemmerQueryFilter
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\query-url\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060629 155852 10 not including:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-prefix
060629 155852 10 parsing:
C:\cygwin\home\HK\nutch-0.7.2\build\plugins\urlfilter-regex\plugin.xml
060629 155852 10 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
060629 155852 10 Added a rating query
o.getTerm = hotels
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotels
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotels
token = hotel
type = word
stemmedValues = hotel
o.getTerm = hotels
token = hotel
type = word
stemmedValues = hotel
query = +(url:hotels^4.0 anchor:hotels^2.0 content:hotels title:hotels^1.5
host:hotels^2.0) +(url:hotel^4.0 anchor:hotel^2.0 content:hotel title:hotel^2.0)
Total hits: 1
060629 155852 11 found resource common-terms.utf8 at
file:/C:/cygwin/home/HK/nutch-0.7.2/conf/common-terms.utf8
0 20060629052146/0
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general