I just tried the following code in NutchDocumentAnalysis.java:
public TokenStream tokenStream(String field, Reader reader) {
TokenStream ts = CommonGrams.getFilter(new
NutchDocumentTokenizer(reader), field);
TokenStream ts2 = new LowerCaseFilter(ts);
return new PorterStemFilter(ts2);
It seems to work when parsing documents, but I get problems
when trying to do a search. It stores the stemmed phrases in
the index, but it doesn't do it on search terms. So if "kittens"
is saved as "kitten" in the index, I have to search for "kitten"
to get it returned. Searching for "kittens" doesn't yield any results.
Any tips on how to stem the search query text?
Howie
Thanks for the info! Is the way to approach it just to call
PorterStemFilter in NutchDocumentAnalysis.java? Something
like this:
/** Analyzer used to index textual content. */
private static class ContentAnalyzer extends Analyzer {
/** Constructs a [EMAIL PROTECTED] NutchDocumentTokenizer}. */
public TokenStream tokenStream(String field, Reader reader) {
TokenStream ts = CommonGrams.getFilter(new
NutchDocumentTokenizer(reader), field);
return new PorterStemFilter(ts);
}
}
Am I completely off-base?
Howie
From: Andy Liu <[EMAIL PROTECTED]>
There's a couple that have been developed for Lucene. You'd have to
modify the Nutch code to use your new stemming analyzer.
On 6/8/05, J�r�me Charron <[EMAIL PROTECTED]> wrote:
> > It seems that stemming is not working for me in nutch. If a document
> > has the word "kittens" in it, when I search for "kitten" it is not
> > being returned. Is there something I need to do to enable or install
> > support for stemming in English?
>
> As far as I know, it does not seem to me that the Nutch Analyzer
performs
> stemming.
> I planned for the next release to write a proposal for integrating
> multi-language analyzers in Nutch (like in Lucene).
> But for now, as far as I know, there is nothing done on this area.
>
> Jerome
>
> --
> http://motrech.free.fr/
> http://frutch.free.fr/
>
>