Hi, I’m quite sure standard tokenizer doesn’t support Unicode combining characters. The question is, how to process them. I think for Russian language the best way is simply to skip this character (create token text without this character), because it is just used to show, where is the accent in the word. Accent signs are hardly ever used in Russian texts a should be treated as the same word with or without them.
Borek From: Tamás Dömők [mailto:domokta...@gmail.com] Sent: Wednesday, July 24, 2019 5:47 PM To: clucene-developers@lists.sourceforge.net Subject: Re: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi, thanks a lot for the hints. Changing the locale did not work, but now I have a better understanding, and I could make some hack for "fixing" the StandardTokenizer. Федера́ция here the а́ character is actually split to а and ́ where the last one (0x0301 Combining Acute Accent) is not considered alphanumerical by the _istalnum(ch) function. #define ALNUM (_istalnum(ch) != 0) thanks for the help, have a nice day! On Wed, 24 Jul 2019 at 15:57, Kostka Bořivoj <kos...@tovek.cz<mailto:kos...@tovek.cz>> wrote: Hi, The problem should be in StandardTokenizer. Unfortunately I’m not familiar with it, as we are using our own tokenizer. So I’m just guessing. 1) It uses _istspace which is mapped to iswspace. Some time ago I discovered these function uses standard “C” locale by default (and doesn’t work well with non-english characters) We solved this problem by calling setlocale( LC_CTYPE, "" ) during program startup. No idea if this helps, but it is easy to try. 2) I have really bad experience with non-ascii characters inside source code, especially in multiplatform environment we use (windows + linux). It should work OK if file is in UTF-8, but we still had BOM/without BOM issues. We encode characters as \uNNNN if we need it in source (there is free online converters, like https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php Borek From: Tamás Dömők [mailto:domokta...@gmail.com<mailto:domokta...@gmail.com>] Sent: Wednesday, July 24, 2019 3:18 PM To: clucene-developers@lists.sourceforge.net<mailto:clucene-developers@lists.sourceforge.net> Subject: Re: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi, i checked my index with Luke. These are the tokens in my index: 1 content официально 1 content росси 1 content также 1 content федера 1 content ция 1 content я 1 content йская It's interesting the word Федера́ция is split to федера and ция. Yes TCHAR is defined as wchar_t on my platform(s) (it behaves exactly the same on mac, linux and windows for me.) Thanks for this Luke tool, it's awesome. On Wed, 24 Jul 2019 at 14:49, Kostka Bořivoj <kos...@tovek.cz<mailto:kos...@tovek.cz>> wrote: Hi again It would be interesting to explore index content. Seems to me, the the word “Федера́ция” is treated as two words Федер and ция (а́ is treated as space in other words). You can use Luke (https://code.google.com/archive/p/luke/downloads) to explore index content Regards Borek From: Tamás Dömők [mailto:domokta...@gmail.com<mailto:domokta...@gmail.com>] Sent: Wednesday, July 24, 2019 11:41 AM To: clucene-developers@lists.sourceforge.net<mailto:clucene-developers@lists.sourceforge.net> Subject: [CLucene-dev] Wildcard query on a Russian text is not working for me Hi all, I'm trying to index some Russian content and search in this content using the CLucene library (v2.3.3.4-10). It works most of the time, but on some words the wildcard query is not working for me, and I have no idea why. Can anybody help me on this, please? Here is my source code: main.cc: #include <QCoreApplication> #include <QString> #include <QDebug> #include <QScopedPointer> #include <CLucene.h> const TCHAR FIELD_CONTENT[] = L"content"; const char INDEX_PATH[] = "/tmp/index"; void create_index(const QString &content) { lucene::analysis::standard::StandardAnalyzer analyzer; lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); lucene::document::Document doc; std::wstring content_buffer = content.toStdWString(); doc.add(*_CLNEW lucene::document::Field(FIELD_CONTENT, content_buffer.data(), lucene::document::Field::STORE_NO | lucene::document::Field::INDEX_TOKENIZED | lucene::document::Field::TERMVECTOR_NO, true)); writer.addDocument(&doc); writer.flush(); writer.close(true); } void search(const QString &query_string) { lucene::search::IndexSearcher searcher(INDEX_PATH); lucene::analysis::standard::StandardAnalyzer analyzer; lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); parser.setAllowLeadingWildcard(true); std::wstring query = query_string.toStdWString(); QScopedPointer< lucene::search::Query > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); QScopedPointer< lucene::search::Hits > hits(searcher.search(lucene_query.data())); TCHAR *query_debug_string(lucene_query->toString()); qDebug() << "found?" << query_string << QString::fromWCharArray(query_debug_string) << (hits->length() > 0); free(query_debug_string); } int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); create_index(QString("Росси́я официально также Росси́йская Федера́ция")); search(QString("noWordLkeThis")); // ok search(QString("Федера́ция")); // ok search(QString("Федер*ция")); // ERROR: it should work, but it doesn't search(QString("Фед*")); // ok search(QString("Федер")); // ok search(QString("\"федера ция\"")); // why is this working? search(QString("официально")); // ok search(QString("офиц*ьно")); // ok search(QString("оф*циально")); // ok search(QString("офици*но")); // ok return 0; } cluceneutf8.pro<http://cluceneutf8.pro>: QT -= gui CONFIG += c++11 console CONFIG -= app_bundle CONFIG += link_pkgconfig PKGCONFIG += libclucene-core SOURCES += \ main.cc qmake && make && ./cluceneutf8 The output of the program: found? "noWordLkeThis" "content:nowordlkethis" false found? "Федера́ция" "content:\"федера ция\"" true found? "Федер*ция" "content:федер*ция" false found? "Фед*" "content:фед*" true found? "Федер" "content:федер" false found? "\"федера ция\"" "content:\"федера ция\"" true found? "официально" "content:официально" true found? "офиц*ьно" "content:офиц*ьно" true found? "оф*циально" "content:оф*циально" true found? "офици*но" "content:офици*но" true It's built with Qt and qmake, but I also made a non-Qt version if that would be better to share, I can. So my problem is that I can search for Федера́ция but I can't search for Федер*ция for example. Other words like официально can be searched anyway. Thanks. -- Dömők Tamás _______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net<mailto:CLucene-developers@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/clucene-developers -- Dömők Tamás _______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net<mailto:CLucene-developers@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/clucene-developers -- Dömők Tamás
_______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/clucene-developers