Re: Stemming with Nutch

Enis Soztutar Thu, 28 Jun 2007 07:31:29 -0700


Doğacan Güney wrote:

On 6/28/07, Robert Young <[EMAIL PROTECTED]> wrote:

Hi,

Are the Nutch Stemming modifications available as a patch? I can't
seem to find anything on issue.apache.org


There is some sort of stemming for German and French languages
(available as plugin analysis-de and analysis-fr). I don't know how
well they work (or if they work). AFAIK, there is no support for
stemming English.

There is PorterStemmer in lucene, but is not used in nutch. You caneasily add this by overriding NutchDocumentAnalyzer.


Btw, I think we should revise nutch's document analysis system. For
example, analyzers for index-basic's fields are hard-coded in analysis
package (what happens if I don't use index-basic and use my own
index-mind-blowingly-awesome plugin?) . You either have to use all of
it or completely override it and use none of it. We should allow index
plugins to specify their analyzers per field. There are analysis-*
plugins but they work for documents in specific languages (what if I
don't want to use language identification? what if nutch can't figure
out what the language is?)

I strongly agree. Index-* plugins and analysis-* plugins are crossdependent. For every new field added by the indexing plugins, ALL theanalysis plugins should be changed to analyze this new field, whichbrakes the golden rule. I agree with the idea that index plugins shouldspecify their analyzers.


Index plugins should also be able control how stuff like their field's
length norm is calculated (which currently is hard coded too and can't
be changed).

Oh and, if you are feeling up to it, any help in this area would be
much appreciated :).


Thanks
Rob

Re: Stemming with Nutch

Reply via email to