Hey, I am not an expert on this but I think you should look into CJKAnalyzer / CJKTokenizer
simon On Thu, Feb 17, 2011 at 8:05 PM, CassUser CassUser <cassu...@gmail.com> wrote: > Hey all, > > I'm somewhat new to Lucene. Meaning I used it some time ago for a parser we > wrote to tokenize a document into word grams. > > the approach I took was simple as follows: > > 1. extended the lucene Analyzer > 2. In the tokenStream method use ShingleMatrixFilter. Passed in the > standard tokenizer, and shingle min/max/splitter. > > This worked pretty well for us. Now we would like to tokenize hangul/korean > into word grams. > > I'm curious others have done something similar and would share their > experience. Any pointers to get started with this would be great. > > Thanks. > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org