Christophe Noel wrote:
I need to develop a "french" parser. Google index french documents parsing "é" (HTML : e´) and "è" characters to "e". I think there's is already french parser for Lucene, so this is not really a problem.

Would it be a problem to simply make this conversion for all languages? Does Google distinguish between "é", "è" and "e" for other languages?


Problem is : can it be created as a nutch plugin ?

It is a little complicated to add language-specific tokenization, since Nutch's tokenzier is currently defined together with its query parser, and each plugin should not have to re-write the query parser, as it is rather complex.


A good way to handle this might be to rewrite the query parser so that it uses a language-specific tokenizer as input. Each plugin would define a tokenizer. Plugins would be selected by language, with a configuration-defined default. Most implementations would probably simply apply a token filter to the output of a standard tokenizer implementation. The tokenizer must always split tokens at query syntax characters. The query parser must then declare a list of query syntax characters.

Each plugin should also define a stop list. In Nutch, stop lists are not used at index time, but rather only applied by the query parser to terms that are not either in a phrase or explicitly required.

So the API might look something like:

/** Factory to get plugin implementation. */
public class LanguageAnalyzerFactory {
  public static Analyzer getAnalyzer(String language);
}

/** Implemented by plugins. */
public interface LanguageAnalyzer {
  TokenStream getTokenStream(Reader reader);
  boolean isStopWord(String term);
}

/** A default implementation. Most LanguageAnalyzer plugins will apply a filter to this. */
public class NutchTokenizer implements TokenStream {
// returns the same strings as existing NutchAnalysis.term()
public Token next();
}


Does this sound like the best approach? Is anyone willing to try to implement this? It requires JavaCC hacking...

Doug




------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to