RE: [Lucene-dev] I18N issue with Lucene...

Doug Cutting Wed, 11 Jul 2001 08:56:28 -0700
> From: David Li [mailto:[EMAIL PROTECTED]]
> 
>    The question would be where this segmentor should be. 

This should be an implementation of com.lucene.analysis.Tokenizer.

The design of Lucene is that conversion from markup languages to plain text
should be done before a Document object is created.  In other words, text
fields in a Document object should contain only text, not markup.  Then,
implementations of Tokenizer should break this text into words.
Implementations of TokenFilter may subsequently process these words.  An
Analyzer creates a Tokenizer, and optionally some TokenFilters.  So
Tokenizer is the best place for Chinese segmentation, and you'll also need
to define an Analyzer that uses your Tokenizer.


>    Functionally, the BreakIterator is very closely related to the 
> segmentor. We are looking into the possibility to integrate our 
> segmentor into the BreakIterator framework.
> 
>    If we have Lucene to use ICU4J at the bottom, we could get 
> Lucene to 
> become a search engine that is capable handle as many languages as 
> supported by ICU4J.
> 
> I'd like to know what the group think of this idea.

I think that's a great idea.  BreakIterator does not operate on a Reader,
but on a String, so a BreakIterator-based Tokenizer implementation would
have to first convert the text from a reader into a String.  This might
looks something like:

public class BreakIteratorTokenizer {
  private BreakIterator breakIterator;
  private String text;

  public BreakIteratorTokenizer(Reader reader,
                                BreakIterator breakIterator,
                                Locale locale)
    throws IOException {
    this.reader = reader;

    // convert text from Reader to String
    StringBuffer stringBuffer = new StringBuffer();
    char[] chars = new char[1024];
    for (int i = reader.read(chars); i != -1; i = reader.read(chars)) {
      stringBuffer.append(chars, 0, i);
    }
    this.text = stringBuffer.toString();
    
    // get word iterator
    breakIterator.setText(text);
    this.breakIterator = breakIterator.getWordInstance(locale);
  }

  public Token next() throws IOException {
    int start = breakIterator.current();
    int end = breakIterator.next();
    if (end == BreakIterator.DONE)
      return null;
    else
      return new Token(text.substring(start, end), start, end);
  }
}

Then you could define a ChineseAnalyzer class that constructed a
BreakIteratorTokenizer with the appropriate BreakIterator and Locale.  You
might also need to define and add a filter to remove between-word text,
which BreakIterator does not identify well.

Doug

_______________________________________________
Lucene-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-dev
RE: [Lucene-dev] I18N issue with Lucene...

Reply via email to