Hi,

If you don't want to config the tokenizer and filters yourself, you can
use lucene-analyzers-smartcn, the default behavior will do most of the
trick.

All the best

Liu Bo


On 25 March 2014 00:24, Allison, Timothy B. <talli...@mitre.org> wrote:

> To expand on Herb's comment, in Lucene, the StandardAnalyzer will break
> CJK into characters:
>
> 1 : 轻
> 2 : 歌
> 3 : 曼
> 4 : 舞
> 5 : 庆
> 6 : 元
> 7 : 旦
>
> If you initialize the classic QueryParser with StandardAnalyzer, the
> parser will use that Analyzer to break this string into individual
> characters as above.  From a linguistic standpoint, this is unnerving, but
> from a retrieval perspective, this should work fairly well as long as you
> are also doing some kind of normalization (ICU or CJKWidthFilter).  As Herb
> mentioned, you might consider experimenting with smartcn to try to tokenize
> on actual words; as an example, the SmartChineseAnalyzer breaks the string
> into:
>
> 1 : 轻歌曼舞
> 2 : 庆
> 3 : 元旦
>
> In Solr, if you use the default "text_cjk", you'll get this bigram
> behavior because of CJKBigramFilterFactory.  If you don't want bigram
> behavior, consider removing that filter; or if you want both bigrams and
> unigrams, consider adding: outputUnigrams="true" as in:
>
> <filter class="solr.CJKBigramFilterFactory" outputUnigrams="true"/>
>
>
> -----Original Message-----
> From: Herb Roitblat [mailto:herb.roitb...@orcatec.com]
> Sent: Monday, March 24, 2014 9:01 AM
> To: java-user@lucene.apache.org; kalaiselva...@zohocorp.com
> Subject: Re: QueryParser
>
> The default query parser for CJK languages breaks text into bigrams.  A
> word consisting of characters ABCDE is broken into tokens  AB, BC, CD,
> DE, or
>
> "轻歌曼舞庆元旦"
>
> into
> data:轻歌 data:歌曼 data:曼舞 data:舞庆 data:庆元 data:元旦
>
> Each pair may or may not be a word, but if you use the same parser (i.e.
> analyzer) for indexing and for searching, you should get reasonable
> results.  A more powerful parser, typically one that includes a
> dictionary, is available, and may give more expected analyses at the
> cost of being slower.
>
> Look here, for example:
> http://lucene.apache.org/core/4_0_0/analyzers-common/index.html
> and here: http://lucene.apache.org/core/4_0_0/analyzers-smartcn/index.html
>
>
>
> On 3/23/2014 11:21 PM, kalaik wrote:
> > Dear Team,
> >
> >                  Any Update ?
> >
> >
> >
> >
> >
> >
> >
> >
> > ---- On Fri, 21 Mar 2014 14:40:51 +0530 kalaik &
> lt;kalaiselva...@zohocorp.com&gt; wrote ----
> >
> >
> >
> >
> > Dear Team,
> >
> >                  we are using lucene in our product , it well searching
> for high speed and performance but
> >
> >
> >                  Japaneese, chinese and korean language not searching
> properly we had use QueryParser
> >
> >
> >                  QueryParser is splitted into word like "轻歌曼舞庆元旦"
> >
> >
> >                   Example
> >
> >                              This word "轻歌曼舞庆元旦"
> >
> >                             splited word :  data:轻歌 data:歌曼 data:曼舞
> data:舞庆 data:庆元 data:元旦
> >
> > here is my code
> >
> >                              Query query =  parser.parse(searchData);
> >
> >                               logger.log(Level.INFO,"Search Query is
> calling {0}",query);
> >
> >                               TopDocs docs = is.search(query,
> resultRowSize);
> >
> >
> > In case of any clarification please get back to me. please help as soon
> as possible
> >
> >
> > Regards,
> > kalai..
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
All the best

Liu Bo

Reply via email to