Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

Kostka Bořivoj Mon, 10 Jul 2023 06:14:08 -0700

CLucene supports at least Unicode plane 0
CLucene uses wchar_t as internal representation, while indexes uses UTF-8
You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only 
US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported


Not 100% sure about Standard Analyzer, because we don’t use them, but I can’t 
see any problem in it.

In your Greek query, the problem can also be with lowercasing and  „ending 
sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma)

Hope this helps

Borivoj

From: Achyuth Pramod [mailto:achyuthpra...@gmail.com]
Sent: Monday, July 10, 2023 2:32 PM
To: clucene-developers@lists.sourceforge.net
Subject: [CLucene-dev] Inquiry about CLucene's UTF-8 support


Dear developers,

I am using CLucene in my project and I would like to inquire about the UTF-8 
encoding support in the Standard Analyzer. Specifically, I would like to know 
if the Standard Analyzer handles tokenization and text processing correctly for 
non-Latin UTF-8 encoded text.

Could you please confirm if the Standard Analyzer in CLucene has built-in 
support for UTF-8 encoded text? If not, are there any recommended alternatives 
or additional analyzers that provide better support for non-Latin UTF-8 text?

The below is the search results of few queries
Max Docs: 1
Num Docs: 1
Current Version: 1688707923968.0
Term count: 66

Enter query string: dignissimos
Searching for: dignissimos

0. /home/nonLatin100Rows.csv - 0.04746387


Search took: 0 ms.
Screen dump took: 0 ms.

Enter query string: διαχειριστής
Searching for:



Search took: 0 ms.
Screen dump took: 0 ms.
Thank you for your time.

- Achyuth Pramod

_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support

Reply via email to