Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

Kostka Bořivoj Fri, 14 Jul 2023 01:36:51 -0700

Hello,

I see differences in CJK languages (Chinese, Japanese, Korean). Note that 
segmentation (aka tokenization) for these languages is a very complex task 
because
they do not use spaces to separate words. There are some techniques to work 
around this, e.g. creating bigrams. And of course they exist
some segmentation libraries based on NLP (e.g. Stanford has one). I think 
bigrams should be generated by the CLucene standard analyzer, but I've never 
tried that.


Also, in the Greek sigma ending, the standard sigma character is changed (as I 
mentioned in my previous email), but I don't think that should be a problem,
since the same thing is done during the search.

I’m afraid there is no easy way to produce the same tokens as JavaLucene. You 
can of course modify Std Analyzer or write down your own.

Regards

Borek

From: Achyuth Pramod [mailto:[email protected]]
Sent: Friday, July 14, 2023 8:27 AM
To: [email protected]
Subject: Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

Hi Developers,
I am attaching the tokens generated from Java Lucene and CLucene. I am getting 
different tokens for non-latin texts using StandardAnalyser.
Is there a solution which will generate the same tokens for CLucene as the Java 
Lucene?

Thanks & Regards,
Achyuth Pramod

On Mon, Jul 10, 2023 at 6:44 PM Kostka Bořivoj 
<[email protected]<mailto:[email protected]>> wrote:
CLucene supports at least Unicode plane 0
CLucene uses wchar_t as internal representation, while indexes uses UTF-8
You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only 
US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported

Not 100% sure about Standard Analyzer, because we don’t use them, but I can’t 
see any problem in it.

In your Greek query, the problem can also be with lowercasing and  „ending 
sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma)

Hope this helps

Borivoj

From: Achyuth Pramod 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Monday, July 10, 2023 2:32 PM
To: 
[email protected]<mailto:[email protected]>
Subject: [CLucene-dev] Inquiry about CLucene's UTF-8 support


Dear developers,

I am using CLucene in my project and I would like to inquire about the UTF-8 
encoding support in the Standard Analyzer. Specifically, I would like to know 
if the Standard Analyzer handles tokenization and text processing correctly for 
non-Latin UTF-8 encoded text.

Could you please confirm if the Standard Analyzer in CLucene has built-in 
support for UTF-8 encoded text? If not, are there any recommended alternatives 
or additional analyzers that provide better support for non-Latin UTF-8 text?

The below is the search results of few queries
Max Docs: 1
Num Docs: 1
Current Version: 1688707923968.0
Term count: 66

Enter query string: dignissimos
Searching for: dignissimos

0. /home/nonLatin100Rows.csv - 0.04746387


Search took: 0 ms.
Screen dump took: 0 ms.

Enter query string: διαχειριστής
Searching for:



Search took: 0 ms.
Screen dump took: 0 ms.
Thank you for your time.

- Achyuth Pramod
_______________________________________________
CLucene-developers mailing list
[email protected]<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/clucene-developers

_______________________________________________
CLucene-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

Reply via email to