To expand on Herb's comment, in Lucene, the StandardAnalyzer will break CJK 
into characters: 

1 : 轻
2 : 歌
3 : 曼
4 : 舞
5 : 庆
6 : 元
7 : 旦

If you initialize the classic QueryParser with StandardAnalyzer, the parser 
will use that Analyzer to break this string into individual characters as 
above.  From a linguistic standpoint, this is unnerving, but from a retrieval 
perspective, this should work fairly well as long as you are also doing some 
kind of normalization (ICU or CJKWidthFilter).  As Herb mentioned, you might 
consider experimenting with smartcn to try to tokenize on actual words; as an 
example, the SmartChineseAnalyzer breaks the string into:

1 : 轻歌曼舞
2 : 庆
3 : 元旦

In Solr, if you use the default "text_cjk", you'll get this bigram behavior 
because of CJKBigramFilterFactory.  If you don't want bigram behavior, consider 
removing that filter; or if you want both bigrams and unigrams, consider 
adding: outputUnigrams="true" as in:

<filter class="solr.CJKBigramFilterFactory" outputUnigrams="true"/>


-----Original Message-----
From: Herb Roitblat [mailto:herb.roitb...@orcatec.com] 
Sent: Monday, March 24, 2014 9:01 AM
To: java-user@lucene.apache.org; kalaiselva...@zohocorp.com
Subject: Re: QueryParser

The default query parser for CJK languages breaks text into bigrams.  A 
word consisting of characters ABCDE is broken into tokens  AB, BC, CD, 
DE, or

"轻歌曼舞庆元旦"

into
data:轻歌 data:歌曼 data:曼舞 data:舞庆 data:庆元 data:元旦

Each pair may or may not be a word, but if you use the same parser (i.e. 
analyzer) for indexing and for searching, you should get reasonable 
results.  A more powerful parser, typically one that includes a 
dictionary, is available, and may give more expected analyses at the 
cost of being slower.

Look here, for example: 
http://lucene.apache.org/core/4_0_0/analyzers-common/index.html
and here: http://lucene.apache.org/core/4_0_0/analyzers-smartcn/index.html



On 3/23/2014 11:21 PM, kalaik wrote:
> Dear Team,
>
>                  Any Update ?
>
>
>
>
>
>
>
>
> ---- On Fri, 21 Mar 2014 14:40:51 +0530 kalaik 
> &lt;kalaiselva...@zohocorp.com&gt; wrote ----
>
>
>
>
> Dear Team,
>
>                  we are using lucene in our product , it well searching for 
> high speed and performance but
>
>
>                  Japaneese, chinese and korean language not searching 
> properly we had use QueryParser
>
>
>                  QueryParser is splitted into word like "轻歌曼舞庆元旦"
>
>
>                   Example
>                          
>                              This word "轻歌曼舞庆元旦"
>   
>                             splited word :  data:轻歌 data:歌曼 data:曼舞 data:舞庆 
> data:庆元 data:元旦
>
> here is my code
>
>                              Query query =  parser.parse(searchData);
>           
>                               logger.log(Level.INFO,"Search Query is calling 
> {0}",query);
>                                  
>                               TopDocs docs = is.search(query, resultRowSize);
>
>
> In case of any clarification please get back to me. please help as soon as 
> possible
>
>
> Regards,
> kalai..
>
>
>
>
>
>
>
>
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to