Re: Is there bug in CJKAnalyzer?

Ivan Vasilev Tue, 23 Oct 2007 05:09:18 -0700

Thanks Samir :)

You info was really helpful for us. I saw the index by Luke and there
the Chinese signs were split in pairs as you said �C AB BC CD etc. Also
when querying for ABC it is split in the query to AB BC.
But how to understand the meaning of this:
“To overcome this, you have to index chinese characters as single tokens
(this will increase recall, but decrease precision).”

I understand it so: To increase the results I have to use instead of the 
Chinese another analyzer that makes tokenization of the text character by 
character. 

Do you know if the range searches  work correctly with CJK texts when during 
the indexing was used CJKAnalyzer? My opinion is that as the terms are sorted 
by using String.copmareTo method but not Collator.compare then the range 
searches will as close to correct as the second method is close to the first 
one.
Interesting is when passing more than 2 Chinese characters �C then the Lucene 
query is not split to bigram sub queries (as in the normal queries) and may be 
some incorrect results can come because of this.

Example: there is indexed the text: AAB. The index will contain: AA AB.
User range search is: content:[AG TO PQ]
Here may be B will mean some word, and user will expect it in the results (as B 
is after AG and before PQ), but as it is not single token but in pair AB which 
is before AG, it will not be returned.
Here may be I guess your answer �C to prevent this use per character analyzer 
:) 

Thanks once again Samir for your explanation of how CJKAnalyzer works it was 
really calming for me to know that my CJKAnalyzer work as it is expected. 

Best Regards,
Ivan

Samir Abdou wrote:
> Hi,
>
> For a chinese token like ABCD (where A,B,C and D are chinese signs),
> CJKAnalyzer will generate the following overlapping bigrams: AB  BC  CD.
> Thus issuing a query containing one chinese sign will not retrieve any
> documents.  To overcome this, you have to index chinese characters as single
> tokens (this will increase recall, but decrease precision).
>
> Hope this will help,
> Samir
>
>
>
> 2007/10/22, Ivan Vasilev <[EMAIL PROTECTED]>:
>   
>> Hi Guys,
>>
>> I have made tests with the CJKAnalyzer and the results show something
>> that seems very strange to me. First I have to say that I do not
>> understand non of the CJK languages.
>> What I do is the following I write some text in English and translate it
>> using an on-line tool, which give me the translated script per word or
>> per group of words. The translated text I put in separate files and
>> index them using proper encoding for readers.
>> What is strange is that when searching just one hieroglyph (no matter if
>> it is separate word in the text or part of a word) Lucene almost never
>> finds result (may be only in less than 5% find results for word like �C
>> that=那, commas and so).
>> I also copy/pasted text from Chinese Academy of Science web site to
>> ignore results in case the translation toll does not work correctly. The
>> result is the same.
>> But when searching for two or more consequent hieroglyphs everything is
>> OK if they persist in the text they are found.
>>
>> So my question is: Is this normal behavior for CJKAnalyzer �C not to find
>> results when only one hieroglyph is searched or there is some bug with
>> that Analyzer?
>>
>> I also would like to say that I reindexed with a very simple class (not
>> with our searching engine) to ignore any possible mistakes. The results
>> are the same.
>>
>> I will give the example of the text that I use:
>>
>> English:
>>
>> The quick brown fox jumped over the lazy dog.
>>
>> Chinese:
>>
>> 灵布朗狐逾懒狗。
>>
>> English word by word:
>>
>> |NA The |1 quick |2 brown |3 fox |4 jumped over |NA the |5 lazy |6 dog |7.
>>
>> Responding Chinese words:
>>
>> |1 灵 |2 布朗 |3 狐 |4 逾 |5 懒 | 6 狗 |7。
>>
>> NOTE: My files contain only the Chinese text.
>>
>> Best Regards,
>> Ivan
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>     

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Is there bug in CJKAnalyzer?

Reply via email to