CLucene supports at least Unicode plane 0 CLucene uses wchar_t as internal representation, while indexes uses UTF-8 You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported
Not 100% sure about Standard Analyzer, because we don’t use them, but I can’t see any problem in it. In your Greek query, the problem can also be with lowercasing and „ending sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma) Hope this helps Borivoj From: Achyuth Pramod [mailto:achyuthpra...@gmail.com] Sent: Monday, July 10, 2023 2:32 PM To: clucene-developers@lists.sourceforge.net Subject: [CLucene-dev] Inquiry about CLucene's UTF-8 support Dear developers, I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text. Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text? The below is the search results of few queries Max Docs: 1 Num Docs: 1 Current Version: 1688707923968.0 Term count: 66 Enter query string: dignissimos Searching for: dignissimos 0. /home/nonLatin100Rows.csv - 0.04746387 Search took: 0 ms. Screen dump took: 0 ms. Enter query string: διαχειριστής Searching for: Search took: 0 ms. Screen dump took: 0 ms. Thank you for your time. - Achyuth Pramod
_______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/clucene-developers