Re: Splitting word tokens - other languages

Simon Willnauer Sat, 19 Feb 2011 15:24:05 -0800

Hey,

I am not an expert on this but I think you should look into
CJKAnalyzer / CJKTokenizer


simon

On Thu, Feb 17, 2011 at 8:05 PM, CassUser CassUser <cassu...@gmail.com> wrote:
> Hey all,
>
> I'm somewhat new to Lucene.  Meaning I used it some time ago for a parser we
> wrote to tokenize a document into word grams.
>
> the approach I took was simple as follows:
>
> 1. extended the lucene Analyzer
> 2. In the tokenStream method use ShingleMatrixFilter.  Passed in the
> standard tokenizer, and shingle min/max/splitter.
>
> This worked pretty well for us.  Now we would like to tokenize hangul/korean
> into word grams.
>
> I'm curious others have done something similar and would share their
> experience.  Any pointers to get started with this would be great.
>
> Thanks.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Splitting word tokens - other languages

Reply via email to