another way to do this: pinyin4j you can trans all Chinese words to pinyin form first, and index the pinyin form as a field, then you can search on them
see: http://www.slideshare.net/tangfl/ss-2364878 in which we implement a pinyin search for our music search 2009/12/16 Weiwei Wang <ww.wang...@gmail.com>: > Thanks Erick, I''ll take a carefull study of that > > 2009/12/16 Erick Erickson <erickerick...@gmail.com> > >> If your queries are still slow, make sure you're not measuring >> the *first* query on a newly opened searcher. There are >> other tips here that might be useful. These are general searching >> tips complimentary to Robert's suggestions.. >> >> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed >> >> <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>HTH >> Erick >> >> 2009/12/15 Weiwei Wang <ww.wang...@gmail.com> >> >> > Thanks Robert, a lot is learned from you:-) >> > >> > On Wed, Dec 16, 2009 at 11:53 AM, Robert Muir <rcm...@gmail.com> wrote: >> > >> > > Hi, just one more thought for you. >> > > >> > > I think even more important than anything I said before, you should >> > ensure >> > > you implement reusableTokenStream in your analyzer. >> > > this becomes a necessity if you are using expensive objects like this. >> > > >> > > 2009/12/15 Weiwei Wang <ww.wang...@gmail.com> >> > > >> > > > Finally, i make it run, however, it works so slow >> > > > >> > > > 2009/12/15 Weiwei Wang <ww.wang...@gmail.com> >> > > > >> > > > > got it, thanks, Robert >> > > > > >> > > > > >> > > > > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rcm...@gmail.com> >> > > wrote: >> > > > > >> > > > >> if you have lucene 2.9 or 3.0 source code, just run patch -p0 < >> > > > >> /path/to/LUCENE-XXYY.patch from the lucene source code root >> > > directory... >> > > > >> it >> > > > >> should create the necessary directory and files. >> > > > >> then run 'ant' , in this case it should create a lucene-icu jar >> file >> > > in >> > > > >> the >> > > > >> build directory. >> > > > >> >> > > > >> the patch doesnt include the icu dependency itself so you need to >> > get >> > > > that >> > > > >> jar file from www.icu-project.org and have it in your classpath >> > also >> > > > >> >> > > > >> sorry for the trouble, hope to integrate some of this soon for a >> > > future >> > > > >> release. >> > > > >> >> > > > >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang < >> ww.wang...@gmail.com> >> > > > >> wrote: >> > > > >> >> > > > >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu >> > > > >> directory >> > > > >> > in my dowloaded contrib directory. >> > > > >> > >> > > > >> > I'm a rookie guy using patch, i'm currently in the contrib dir, >> > > could >> > > > >> > anybody tell me how to execute this patch command to generate >> the >> > > > >> relevant >> > > > >> > dir and souce files? >> > > > >> > >> > > > >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rcm...@gmail.com> >> > > > wrote: >> > > > >> > >> > > > >> > > look at the latest patch file attached to the issue, it should >> > > work >> > > > >> with >> > > > >> > > lucene 2.9 or greater (I think) >> > > > >> > > >> > > > >> > > 2009/12/15 Weiwei Wang <ww.wang...@gmail.com> >> > > > >> > > >> > > > >> > > > where can i find the source code? >> > > > >> > > > >> > > > >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir < >> > rcm...@gmail.com> >> > > > >> wrote: >> > > > >> > > > >> > > > >> > > > > there is an icu transform tokenfilter in the patch here: >> > > > >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488 >> > > > >> > > > > >> > > > >> > > > > Transliterator pinyin = >> > > > >> Transliterator.getInstance("Han-Latin"); >> > > > >> > > > > Tokenizer tokenizer = new KeywordTokenizer(new >> > > > >> > StringReader("中国")); >> > > > >> > > > > ICUTransformFilter filter = new >> > > ICUTransformFilter(tokenizer, >> > > > >> > > pinyin); >> > > > >> > > > > assertTokenStreamContents(filter, new String[] { "zhōng >> > > guó" >> > > > } >> > > > >> ); >> > > > >> > > > > >> > > > >> > > > > note it will add tone marks and insert space between >> > syllables >> > > > by >> > > > >> > > default >> > > > >> > > > > if you do not want this, you need to do some cleanup. >> > > > >> > > > > >> > > > >> > > > > Transliterator pinyin = >> > > > Transliterator.getInstance("Han-Latin; >> > > > >> > NFD; >> > > > >> > > > > [[:NonspacingMark:][:Space:]] Remove"); >> > > > >> > > > > Tokenizer tokenizer = new KeywordTokenizer(new >> > > > >> > StringReader("中国")); >> > > > >> > > > > ICUTransformFilter filter = new >> > > ICUTransformFilter(tokenizer, >> > > > >> > > pinyin); >> > > > >> > > > > assertTokenStreamContents(filter, new String[] { >> > "zhongguo" >> > > } >> > > > >> ); >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > 2009/12/15 Weiwei Wang <ww.wang...@gmail.com> >> > > > >> > > > > >> > > > >> > > > > > Hi, guys, >> > > > >> > > > > > I'm implementing a search engine based on Lucene for >> > > > >> Chinese. >> > > > >> > So >> > > > >> > > I >> > > > >> > > > > want >> > > > >> > > > > > to support pinyin search as Google China do. >> > > > >> > > > > > >> > > > >> > > > > > e.g. >> > > > >> > > > > > “中国” means Chinese in English >> > > > >> > > > > > this word's pinyin input is "zhongguo" >> > > > >> > > > > > The feature i want to implement is when user type >> zhongguo >> > > the >> > > > >> > > results >> > > > >> > > > > will >> > > > >> > > > > > include documents containing "中国" or even Chinese >> > > > >> > > > > > >> > > > >> > > > > > Anybody here know how to achieve this? >> > > > >> > > > > > >> > > > >> > > > > > -- >> > > > >> > > > > > Weiwei Wang >> > > > >> > > > > > Alex Wang >> > > > >> > > > > > 王巍巍 >> > > > >> > > > > > Room 403, Mengmin Wei Building >> > > > >> > > > > > Computer Science Department >> > > > >> > > > > > Gulou Campus of Nanjing University >> > > > >> > > > > > Nanjing, P.R.China, 210093 >> > > > >> > > > > > >> > > > >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > >> > > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > -- >> > > > >> > > > > Robert Muir >> > > > >> > > > > rcm...@gmail.com >> > > > >> > > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > >> > > > Weiwei Wang >> > > > >> > > > Alex Wang >> > > > >> > > > 王巍巍 >> > > > >> > > > Room 403, Mengmin Wei Building >> > > > >> > > > Computer Science Department >> > > > >> > > > Gulou Campus of Nanjing University >> > > > >> > > > Nanjing, P.R.China, 210093 >> > > > >> > > > >> > > > >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > >> > > > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > > -- >> > > > >> > > Robert Muir >> > > > >> > > rcm...@gmail.com >> > > > >> > > >> > > > >> > >> > > > >> > >> > > > >> > >> > > > >> > -- >> > > > >> > Weiwei Wang >> > > > >> > Alex Wang >> > > > >> > 王巍巍 >> > > > >> > Room 403, Mengmin Wei Building >> > > > >> > Computer Science Department >> > > > >> > Gulou Campus of Nanjing University >> > > > >> > Nanjing, P.R.China, 210093 >> > > > >> > >> > > > >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > >> > >> > > > >> >> > > > >> >> > > > >> >> > > > >> -- >> > > > >> Robert Muir >> > > > >> rcm...@gmail.com >> > > > >> >> > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > Weiwei Wang >> > > > > Alex Wang >> > > > > 王巍巍 >> > > > > Room 403, Mengmin Wei Building >> > > > > Computer Science Department >> > > > > Gulou Campus of Nanjing University >> > > > > Nanjing, P.R.China, 210093 >> > > > > >> > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > Weiwei Wang >> > > > Alex Wang >> > > > 王巍巍 >> > > > Room 403, Mengmin Wei Building >> > > > Computer Science Department >> > > > Gulou Campus of Nanjing University >> > > > Nanjing, P.R.China, 210093 >> > > > >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > > > >> > > >> > > >> > > >> > > -- >> > > Robert Muir >> > > rcm...@gmail.com >> > > >> > >> > >> > >> > -- >> > Weiwei Wang >> > Alex Wang >> > 王巍巍 >> > Room 403, Mengmin Wei Building >> > Computer Science Department >> > Gulou Campus of Nanjing University >> > Nanjing, P.R.China, 210093 >> > >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang >> > >> > > > > -- > Weiwei Wang > Alex Wang > 王巍巍 > Room 403, Mengmin Wei Building > Computer Science Department > Gulou Campus of Nanjing University > Nanjing, P.R.China, 210093 > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > -- 梦的开始挣扎于城市的边缘 心的远方执着在脚步的瞬间 我的宿命埋藏了寂寞的永远 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org