there is an icu transform tokenfilter in the patch here: http://issues.apache.org/jira/browse/LUCENE-1488
Transliterator pinyin = Transliterator.getInstance("Han-Latin"); Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国")); ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin); assertTokenStreamContents(filter, new String[] { "zhōng guó" } ); note it will add tone marks and insert space between syllables by default if you do not want this, you need to do some cleanup. Transliterator pinyin = Transliterator.getInstance("Han-Latin; NFD; [[:NonspacingMark:][:Space:]] Remove"); Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国")); ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin); assertTokenStreamContents(filter, new String[] { "zhongguo" } ); 2009/12/15 Weiwei Wang <ww.wang...@gmail.com> > Hi, guys, > I'm implementing a search engine based on Lucene for Chinese. So I want > to support pinyin search as Google China do. > > e.g. > “中国” means Chinese in English > this word's pinyin input is "zhongguo" > The feature i want to implement is when user type zhongguo the results will > include documents containing "中国" or even Chinese > > Anybody here know how to achieve this? > > -- > Weiwei Wang > Alex Wang > 王巍巍 > Room 403, Mengmin Wei Building > Computer Science Department > Gulou Campus of Nanjing University > Nanjing, P.R.China, 210093 > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > -- Robert Muir rcm...@gmail.com