another way to do this: pinyin4j

you can trans all Chinese words to pinyin form first, and index the
pinyin form as a field, then you can search on them

see: http://www.slideshare.net/tangfl/ss-2364878

in which we implement a pinyin search for our music search

2009/12/16 Weiwei Wang <ww.wang...@gmail.com>:
> Thanks Erick, I''ll take a carefull study of that
>
> 2009/12/16 Erick Erickson <erickerick...@gmail.com>
>
>> If your queries are still slow, make sure you're not measuring
>> the *first* query on a newly opened searcher. There are
>> other tips here that might be useful. These are general searching
>> tips complimentary to Robert's suggestions..
>>
>> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
>>
>> <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed>HTH
>> Erick
>>
>> 2009/12/15 Weiwei Wang <ww.wang...@gmail.com>
>>
>> > Thanks Robert, a lot is learned from you:-)
>> >
>> > On Wed, Dec 16, 2009 at 11:53 AM, Robert Muir <rcm...@gmail.com> wrote:
>> >
>> > > Hi, just one more thought for you.
>> > >
>> > > I think even more important than anything I said before, you should
>> > ensure
>> > > you implement reusableTokenStream in your analyzer.
>> > > this becomes a necessity if you are using expensive objects like this.
>> > >
>> > > 2009/12/15 Weiwei Wang <ww.wang...@gmail.com>
>> > >
>> > > > Finally, i make it run, however, it works so slow
>> > > >
>> > > > 2009/12/15 Weiwei Wang <ww.wang...@gmail.com>
>> > > >
>> > > > > got it, thanks, Robert
>> > > > >
>> > > > >
>> > > > > On Tue, Dec 15, 2009 at 10:19 PM, Robert Muir <rcm...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > >>  if you have lucene 2.9 or 3.0 source code, just run patch -p0 <
>> > > > >> /path/to/LUCENE-XXYY.patch from the lucene source code root
>> > > directory...
>> > > > >> it
>> > > > >> should create the necessary directory and files.
>> > > > >> then run 'ant' , in this case it should create a lucene-icu jar
>> file
>> > > in
>> > > > >> the
>> > > > >> build directory.
>> > > > >>
>> > > > >> the patch doesnt include the icu dependency itself so you need to
>> > get
>> > > > that
>> > > > >> jar file from www.icu-project.org and have it in your classpath
>> > also
>> > > > >>
>> > > > >> sorry for the trouble, hope to integrate some of this soon for a
>> > > future
>> > > > >> release.
>> > > > >>
>> > > > >> On Tue, Dec 15, 2009 at 9:13 AM, Weiwei Wang <
>> ww.wang...@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Yes, i found the patch file LUCENE-1488.patch and there's no icu
>> > > > >> directory
>> > > > >> > in my dowloaded contrib directory.
>> > > > >> >
>> > > > >> > I'm a rookie guy using patch, i'm currently in the contrib dir,
>> > > could
>> > > > >> > anybody tell me how to execute this patch command to generate
>> the
>> > > > >> relevant
>> > > > >> > dir and souce files?
>> > > > >> >
>> > > > >> > On Tue, Dec 15, 2009 at 9:51 PM, Robert Muir <rcm...@gmail.com>
>> > > > wrote:
>> > > > >> >
>> > > > >> > > look at the latest patch file attached to the issue, it should
>> > > work
>> > > > >> with
>> > > > >> > > lucene 2.9 or greater (I think)
>> > > > >> > >
>> > > > >> > > 2009/12/15 Weiwei Wang <ww.wang...@gmail.com>
>> > > > >> > >
>> > > > >> > > > where can i find the source code?
>> > > > >> > > >
>> > > > >> > > > On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <
>> > rcm...@gmail.com>
>> > > > >> wrote:
>> > > > >> > > >
>> > > > >> > > > > there is an icu transform tokenfilter in the patch here:
>> > > > >> > > > > http://issues.apache.org/jira/browse/LUCENE-1488
>> > > > >> > > > >
>> > > > >> > > > >    Transliterator pinyin =
>> > > > >> Transliterator.getInstance("Han-Latin");
>> > > > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
>> > > > >> > StringReader("中国"));
>> > > > >> > > > >    ICUTransformFilter filter = new
>> > > ICUTransformFilter(tokenizer,
>> > > > >> > > pinyin);
>> > > > >> > > > >    assertTokenStreamContents(filter, new String[] { "zhōng
>> > > guó"
>> > > > }
>> > > > >> );
>> > > > >> > > > >
>> > > > >> > > > > note it will add tone marks and insert space between
>> > syllables
>> > > > by
>> > > > >> > > default
>> > > > >> > > > > if you do not want this, you need to do some cleanup.
>> > > > >> > > > >
>> > > > >> > > > >    Transliterator pinyin =
>> > > > Transliterator.getInstance("Han-Latin;
>> > > > >> > NFD;
>> > > > >> > > > > [[:NonspacingMark:][:Space:]] Remove");
>> > > > >> > > > >    Tokenizer tokenizer = new KeywordTokenizer(new
>> > > > >> > StringReader("中国"));
>> > > > >> > > > >    ICUTransformFilter filter = new
>> > > ICUTransformFilter(tokenizer,
>> > > > >> > > pinyin);
>> > > > >> > > > >    assertTokenStreamContents(filter, new String[] {
>> > "zhongguo"
>> > > }
>> > > > >> );
>> > > > >> > > > >
>> > > > >> > > > >
>> > > > >> > > > > 2009/12/15 Weiwei Wang <ww.wang...@gmail.com>
>> > > > >> > > > >
>> > > > >> > > > > > Hi, guys,
>> > > > >> > > > > >     I'm implementing a search engine based on Lucene for
>> > > > >> Chinese.
>> > > > >> > So
>> > > > >> > > I
>> > > > >> > > > > want
>> > > > >> > > > > > to support pinyin search as Google China do.
>> > > > >> > > > > >
>> > > > >> > > > > > e.g.
>> > > > >> > > > > >    “中国”  means Chinese in English
>> > > > >> > > > > >    this word's pinyin input is "zhongguo"
>> > > > >> > > > > > The feature i want to implement is when user type
>> zhongguo
>> > > the
>> > > > >> > > results
>> > > > >> > > > > will
>> > > > >> > > > > > include documents containing "中国" or even Chinese
>> > > > >> > > > > >
>> > > > >> > > > > > Anybody here know how to achieve this?
>> > > > >> > > > > >
>> > > > >> > > > > > --
>> > > > >> > > > > > Weiwei Wang
>> > > > >> > > > > > Alex Wang
>> > > > >> > > > > > 王巍巍
>> > > > >> > > > > > Room 403, Mengmin Wei Building
>> > > > >> > > > > > Computer Science Department
>> > > > >> > > > > > Gulou Campus of Nanjing University
>> > > > >> > > > > > Nanjing, P.R.China, 210093
>> > > > >> > > > > >
>> > > > >> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >> > > > > >
>> > > > >> > > > >
>> > > > >> > > > >
>> > > > >> > > > >
>> > > > >> > > > > --
>> > > > >> > > > > Robert Muir
>> > > > >> > > > > rcm...@gmail.com
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > >
>> > > > >> > > > --
>> > > > >> > > > Weiwei Wang
>> > > > >> > > > Alex Wang
>> > > > >> > > > 王巍巍
>> > > > >> > > > Room 403, Mengmin Wei Building
>> > > > >> > > > Computer Science Department
>> > > > >> > > > Gulou Campus of Nanjing University
>> > > > >> > > > Nanjing, P.R.China, 210093
>> > > > >> > > >
>> > > > >> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >> > > >
>> > > > >> > >
>> > > > >> > >
>> > > > >> > >
>> > > > >> > > --
>> > > > >> > > Robert Muir
>> > > > >> > > rcm...@gmail.com
>> > > > >> > >
>> > > > >> >
>> > > > >> >
>> > > > >> >
>> > > > >> > --
>> > > > >> > Weiwei Wang
>> > > > >> > Alex Wang
>> > > > >> > 王巍巍
>> > > > >> > Room 403, Mengmin Wei Building
>> > > > >> > Computer Science Department
>> > > > >> > Gulou Campus of Nanjing University
>> > > > >> > Nanjing, P.R.China, 210093
>> > > > >> >
>> > > > >> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >> >
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Robert Muir
>> > > > >> rcm...@gmail.com
>> > > > >>
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Weiwei Wang
>> > > > > Alex Wang
>> > > > > 王巍巍
>> > > > > Room 403, Mengmin Wei Building
>> > > > > Computer Science Department
>> > > > > Gulou Campus of Nanjing University
>> > > > > Nanjing, P.R.China, 210093
>> > > > >
>> > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Weiwei Wang
>> > > > Alex Wang
>> > > > 王巍巍
>> > > > Room 403, Mengmin Wei Building
>> > > > Computer Science Department
>> > > > Gulou Campus of Nanjing University
>> > > > Nanjing, P.R.China, 210093
>> > > >
>> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Robert Muir
>> > > rcm...@gmail.com
>> > >
>> >
>> >
>> >
>> > --
>> > Weiwei Wang
>> > Alex Wang
>> > 王巍巍
>> > Room 403, Mengmin Wei Building
>> > Computer Science Department
>> > Gulou Campus of Nanjing University
>> > Nanjing, P.R.China, 210093
>> >
>> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>> >
>>
>
>
>
> --
> Weiwei Wang
> Alex Wang
> 王巍巍
> Room 403, Mengmin Wei Building
> Computer Science Department
> Gulou Campus of Nanjing University
> Nanjing, P.R.China, 210093
>
> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>



-- 
梦的开始挣扎于城市的边缘
心的远方执着在脚步的瞬间
我的宿命埋藏了寂寞的永远

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to