I need to develop a "french" parser. Google index french documents parsing "é" (HTML : e´) and "è" characters to "e". I think there's is already french parser for Lucene, so this is not really a problem.
Would it be a problem to simply make this conversion for all languages? Does Google distinguish between "é", "è" and "e" for other languages?
Problem is : can it be created as a nutch plugin ?
It is a little complicated to add language-specific tokenization, since Nutch's tokenzier is currently defined together with its query parser, and each plugin should not have to re-write the query parser, as it is rather complex.
A good way to handle this might be to rewrite the query parser so that it uses a language-specific tokenizer as input. Each plugin would define a tokenizer. Plugins would be selected by language, with a configuration-defined default. Most implementations would probably simply apply a token filter to the output of a standard tokenizer implementation. The tokenizer must always split tokens at query syntax characters. The query parser must then declare a list of query syntax characters.
Each plugin should also define a stop list. In Nutch, stop lists are not used at index time, but rather only applied by the query parser to terms that are not either in a phrase or explicitly required.
So the API might look something like:
/** Factory to get plugin implementation. */ public class LanguageAnalyzerFactory { public static Analyzer getAnalyzer(String language); }
/** Implemented by plugins. */ public interface LanguageAnalyzer { TokenStream getTokenStream(Reader reader); boolean isStopWord(String term); }
/** A default implementation. Most LanguageAnalyzer plugins will apply a filter to this. */
public class NutchTokenizer implements TokenStream {
// returns the same strings as existing NutchAnalysis.term()
public Token next();
}
Does this sound like the best approach? Is anyone willing to try to implement this? It requires JavaCC hacking...
Doug
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers