Hello, I see differences in CJK languages (Chinese, Japanese, Korean). Note that segmentation (aka tokenization) for these languages is a very complex task because they do not use spaces to separate words. There are some techniques to work around this, e.g. creating bigrams. And of course they exist some segmentation libraries based on NLP (e.g. Stanford has one). I think bigrams should be generated by the CLucene standard analyzer, but I've never tried that.
Also, in the Greek sigma ending, the standard sigma character is changed (as I mentioned in my previous email), but I don't think that should be a problem, since the same thing is done during the search. I’m afraid there is no easy way to produce the same tokens as JavaLucene. You can of course modify Std Analyzer or write down your own. Regards Borek From: Achyuth Pramod [mailto:achyuthpra...@gmail.com] Sent: Friday, July 14, 2023 8:27 AM To: clucene-developers@lists.sourceforge.net Subject: Re: [CLucene-dev] Inquiry about CLucene's UTF-8 support Hi Developers, I am attaching the tokens generated from Java Lucene and CLucene. I am getting different tokens for non-latin texts using StandardAnalyser. Is there a solution which will generate the same tokens for CLucene as the Java Lucene? Thanks & Regards, Achyuth Pramod On Mon, Jul 10, 2023 at 6:44 PM Kostka Bořivoj <kos...@tovek.cz<mailto:kos...@tovek.cz>> wrote: CLucene supports at least Unicode plane 0 CLucene uses wchar_t as internal representation, while indexes uses UTF-8 You must not set ENABLE_ASCII_MODE in CMake during build, otherwise only US-Acscii (or perhaps ISO Latin 1, I‘m not sure) is supported Not 100% sure about Standard Analyzer, because we don’t use them, but I can’t see any problem in it. In your Greek query, the problem can also be with lowercasing and „ending sigma“ (ς) character (see https://en.wikipedia.org/wiki/Sigma) Hope this helps Borivoj From: Achyuth Pramod [mailto:achyuthpra...@gmail.com<mailto:achyuthpra...@gmail.com>] Sent: Monday, July 10, 2023 2:32 PM To: clucene-developers@lists.sourceforge.net<mailto:clucene-developers@lists.sourceforge.net> Subject: [CLucene-dev] Inquiry about CLucene's UTF-8 support Dear developers, I am using CLucene in my project and I would like to inquire about the UTF-8 encoding support in the Standard Analyzer. Specifically, I would like to know if the Standard Analyzer handles tokenization and text processing correctly for non-Latin UTF-8 encoded text. Could you please confirm if the Standard Analyzer in CLucene has built-in support for UTF-8 encoded text? If not, are there any recommended alternatives or additional analyzers that provide better support for non-Latin UTF-8 text? The below is the search results of few queries Max Docs: 1 Num Docs: 1 Current Version: 1688707923968.0 Term count: 66 Enter query string: dignissimos Searching for: dignissimos 0. /home/nonLatin100Rows.csv - 0.04746387 Search took: 0 ms. Screen dump took: 0 ms. Enter query string: διαχειριστής Searching for: Search took: 0 ms. Screen dump took: 0 ms. Thank you for your time. - Achyuth Pramod _______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net<mailto:CLucene-developers@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/clucene-developers
_______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/clucene-developers