Doğacan Güney wrote:
On 6/28/07, Robert Young <[EMAIL PROTECTED]> wrote:
Hi,

Are the Nutch Stemming modifications available as a patch? I can't
seem to find anything on issue.apache.org

There is some sort of stemming for German and French languages
(available as plugin analysis-de and analysis-fr). I don't know how
well they work (or if they work). AFAIK, there is no support for
stemming English.
There is PorterStemmer in lucene, but is not used in nutch. You can easily add this by overriding NutchDocumentAnalyzer.


Btw, I think we should revise nutch's document analysis system. For
example, analyzers for index-basic's fields are hard-coded in analysis
package (what happens if I don't use index-basic and use my own
index-mind-blowingly-awesome plugin?) . You either have to use all of
it or completely override it and use none of it. We should allow index
plugins to specify their analyzers per field. There are analysis-*
plugins but they work for documents in specific languages (what if I
don't want to use language identification? what if nutch can't figure
out what the language is?)

I strongly agree. Index-* plugins and analysis-* plugins are cross dependent. For every new field added by the indexing plugins, ALL the analysis plugins should be changed to analyze this new field, which brakes the golden rule. I agree with the idea that index plugins should specify their analyzers.

Index plugins should also be able control how stuff like their field's
length norm is calculated (which currently is hard coded too and can't
be changed).

Oh and, if you are feeling up to it, any help in this area would be
much appreciated :).


Thanks
Rob



Reply via email to