Hi, You could use sun.text.Normalizer. (see http://www.rgagnon.com/javadetails/java-0456.html). Maybe you should also check first what language the text is written in, before applying the filter.
Thibaut -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thu 3/24/2005 9:08 PM To: nutch-dev@incubator.apache.org Subject: Re: International Parser Christophe Noel wrote: > I need to develop a "french" parser. Google index french documents > parsing "é" (HTML : e´) and "è" characters to "e". I think there's > is already french parser for Lucene, so this is not really a problem. Would it be a problem to simply make this conversion for all languages? Does Google distinguish between "é", "è" and "e" for other languages? > Problem is : can it be created as a nutch plugin ? It is a little complicated to add language-specific tokenization, since Nutch's tokenzier is currently defined together with its query parser, and each plugin should not have to re-write the query parser, as it is rather complex. A good way to handle this might be to rewrite the query parser so that it uses a language-specific tokenizer as input. Each plugin would define a tokenizer. Plugins would be selected by language, with a configuration-defined default. Most implementations would probably simply apply a token filter to the output of a standard tokenizer implementation. The tokenizer must always split tokens at query syntax characters. The query parser must then declare a list of query syntax characters. Each plugin should also define a stop list. In Nutch, stop lists are not used at index time, but rather only applied by the query parser to terms that are not either in a phrase or explicitly required. So the API might look something like: /** Factory to get plugin implementation. */ public class LanguageAnalyzerFactory { public static Analyzer getAnalyzer(String language); } /** Implemented by plugins. */ public interface LanguageAnalyzer { TokenStream getTokenStream(Reader reader); boolean isStopWord(String term); } /** A default implementation. Most LanguageAnalyzer plugins will apply a filter to this. */ public class NutchTokenizer implements TokenStream { // returns the same strings as existing NutchAnalysis.term() public Token next(); } Does this sound like the best approach? Is anyone willing to try to implement this? It requires JavaCC hacking... Doug ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers