Hi,

You could use sun.text.Normalizer. (see 
http://www.rgagnon.com/javadetails/java-0456.html). Maybe you should also check 
first what language the text is written in, before applying the filter.

Thibaut
-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thu 3/24/2005 9:08 PM
To: nutch-dev@incubator.apache.org
Subject: Re: International Parser
 
Christophe Noel wrote:
> I need to develop a "french" parser. Google index french documents 
> parsing "é" (HTML : e´) and "è" characters to "e". I think there's 
> is already french parser for Lucene, so this is not really a problem.

Would it be a problem to simply make this conversion for all languages? 
  Does Google distinguish between "é", "è" and "e" for other languages?

> Problem is : can it be created as a nutch plugin ?

It is a little complicated to add language-specific tokenization, since 
Nutch's tokenzier is currently defined together with its query parser, 
and each plugin should not have to re-write the query parser, as it is 
rather complex.

A good way to handle this might be to rewrite the query parser so that 
it uses a language-specific tokenizer as input.  Each plugin would 
define a tokenizer.  Plugins would be selected by language, with a 
configuration-defined default.  Most implementations would probably 
simply apply a token filter to the output of a standard tokenizer 
implementation.  The tokenizer must always split tokens at query syntax 
characters.  The query parser must then declare a list of query syntax 
characters.

Each plugin should also define a stop list.  In Nutch, stop lists are 
not used at index time, but rather only applied by the query parser to 
terms that are not either in a phrase or explicitly required.

So the API might look something like:

/** Factory to get plugin implementation. */
public class LanguageAnalyzerFactory {
   public static Analyzer getAnalyzer(String language);
}

/** Implemented by plugins. */
public interface LanguageAnalyzer {
   TokenStream getTokenStream(Reader reader);
   boolean isStopWord(String term);
}

/** A default implementation.  Most LanguageAnalyzer plugins will apply 
a filter to this. */
public class NutchTokenizer implements TokenStream {
   // returns the same strings as existing NutchAnalysis.term()
   public Token next();
}

Does this sound like the best approach?  Is anyone willing to try to 
implement this?  It requires JavaCC hacking...

Doug





-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to