Hi, If you don't want to config the tokenizer and filters yourself, you can use lucene-analyzers-smartcn, the default behavior will do most of the trick.
All the best Liu Bo On 25 March 2014 00:24, Allison, Timothy B. <talli...@mitre.org> wrote: > To expand on Herb's comment, in Lucene, the StandardAnalyzer will break > CJK into characters: > > 1 : 轻 > 2 : 歌 > 3 : 曼 > 4 : 舞 > 5 : 庆 > 6 : 元 > 7 : 旦 > > If you initialize the classic QueryParser with StandardAnalyzer, the > parser will use that Analyzer to break this string into individual > characters as above. From a linguistic standpoint, this is unnerving, but > from a retrieval perspective, this should work fairly well as long as you > are also doing some kind of normalization (ICU or CJKWidthFilter). As Herb > mentioned, you might consider experimenting with smartcn to try to tokenize > on actual words; as an example, the SmartChineseAnalyzer breaks the string > into: > > 1 : 轻歌曼舞 > 2 : 庆 > 3 : 元旦 > > In Solr, if you use the default "text_cjk", you'll get this bigram > behavior because of CJKBigramFilterFactory. If you don't want bigram > behavior, consider removing that filter; or if you want both bigrams and > unigrams, consider adding: outputUnigrams="true" as in: > > <filter class="solr.CJKBigramFilterFactory" outputUnigrams="true"/> > > > -----Original Message----- > From: Herb Roitblat [mailto:herb.roitb...@orcatec.com] > Sent: Monday, March 24, 2014 9:01 AM > To: java-user@lucene.apache.org; kalaiselva...@zohocorp.com > Subject: Re: QueryParser > > The default query parser for CJK languages breaks text into bigrams. A > word consisting of characters ABCDE is broken into tokens AB, BC, CD, > DE, or > > "轻歌曼舞庆元旦" > > into > data:轻歌 data:歌曼 data:曼舞 data:舞庆 data:庆元 data:元旦 > > Each pair may or may not be a word, but if you use the same parser (i.e. > analyzer) for indexing and for searching, you should get reasonable > results. A more powerful parser, typically one that includes a > dictionary, is available, and may give more expected analyses at the > cost of being slower. > > Look here, for example: > http://lucene.apache.org/core/4_0_0/analyzers-common/index.html > and here: http://lucene.apache.org/core/4_0_0/analyzers-smartcn/index.html > > > > On 3/23/2014 11:21 PM, kalaik wrote: > > Dear Team, > > > > Any Update ? > > > > > > > > > > > > > > > > > > ---- On Fri, 21 Mar 2014 14:40:51 +0530 kalaik & > lt;kalaiselva...@zohocorp.com> wrote ---- > > > > > > > > > > Dear Team, > > > > we are using lucene in our product , it well searching > for high speed and performance but > > > > > > Japaneese, chinese and korean language not searching > properly we had use QueryParser > > > > > > QueryParser is splitted into word like "轻歌曼舞庆元旦" > > > > > > Example > > > > This word "轻歌曼舞庆元旦" > > > > splited word : data:轻歌 data:歌曼 data:曼舞 > data:舞庆 data:庆元 data:元旦 > > > > here is my code > > > > Query query = parser.parse(searchData); > > > > logger.log(Level.INFO,"Search Query is > calling {0}",query); > > > > TopDocs docs = is.search(query, > resultRowSize); > > > > > > In case of any clarification please get back to me. please help as soon > as possible > > > > > > Regards, > > kalai.. > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- All the best Liu Bo