To expand on Herb's comment, in Lucene, the StandardAnalyzer will break CJK into characters:
1 : 轻 2 : 歌 3 : 曼 4 : 舞 5 : 庆 6 : 元 7 : 旦 If you initialize the classic QueryParser with StandardAnalyzer, the parser will use that Analyzer to break this string into individual characters as above. From a linguistic standpoint, this is unnerving, but from a retrieval perspective, this should work fairly well as long as you are also doing some kind of normalization (ICU or CJKWidthFilter). As Herb mentioned, you might consider experimenting with smartcn to try to tokenize on actual words; as an example, the SmartChineseAnalyzer breaks the string into: 1 : 轻歌曼舞 2 : 庆 3 : 元旦 In Solr, if you use the default "text_cjk", you'll get this bigram behavior because of CJKBigramFilterFactory. If you don't want bigram behavior, consider removing that filter; or if you want both bigrams and unigrams, consider adding: outputUnigrams="true" as in: <filter class="solr.CJKBigramFilterFactory" outputUnigrams="true"/> -----Original Message----- From: Herb Roitblat [mailto:herb.roitb...@orcatec.com] Sent: Monday, March 24, 2014 9:01 AM To: java-user@lucene.apache.org; kalaiselva...@zohocorp.com Subject: Re: QueryParser The default query parser for CJK languages breaks text into bigrams. A word consisting of characters ABCDE is broken into tokens AB, BC, CD, DE, or "轻歌曼舞庆元旦" into data:轻歌 data:歌曼 data:曼舞 data:舞庆 data:庆元 data:元旦 Each pair may or may not be a word, but if you use the same parser (i.e. analyzer) for indexing and for searching, you should get reasonable results. A more powerful parser, typically one that includes a dictionary, is available, and may give more expected analyses at the cost of being slower. Look here, for example: http://lucene.apache.org/core/4_0_0/analyzers-common/index.html and here: http://lucene.apache.org/core/4_0_0/analyzers-smartcn/index.html On 3/23/2014 11:21 PM, kalaik wrote: > Dear Team, > > Any Update ? > > > > > > > > > ---- On Fri, 21 Mar 2014 14:40:51 +0530 kalaik > <kalaiselva...@zohocorp.com> wrote ---- > > > > > Dear Team, > > we are using lucene in our product , it well searching for > high speed and performance but > > > Japaneese, chinese and korean language not searching > properly we had use QueryParser > > > QueryParser is splitted into word like "轻歌曼舞庆元旦" > > > Example > > This word "轻歌曼舞庆元旦" > > splited word : data:轻歌 data:歌曼 data:曼舞 data:舞庆 > data:庆元 data:元旦 > > here is my code > > Query query = parser.parse(searchData); > > logger.log(Level.INFO,"Search Query is calling > {0}",query); > > TopDocs docs = is.search(query, resultRowSize); > > > In case of any clarification please get back to me. please help as soon as > possible > > > Regards, > kalai.. > > > > > > > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org