Re: [CLucene-dev] Wildcard query on a Russian text is not working for me

Kostka Bořivoj Thu, 25 Jul 2019 02:48:57 -0700

Hi,

I’m quite sure standard tokenizer doesn’t support Unicode combining characters.
The question is, how to process them.
I think for Russian language the best way is simply to skip this character 
(create token text without this character), because it is just used to show, 
where is the accent in the word.
Accent signs are hardly ever used in Russian texts a should be treated as the 
same word with or without them.

Borek

From: Tamás Dömők [mailto:domokta...@gmail.com]
Sent: Wednesday, July 24, 2019 5:47 PM
To: clucene-developers@lists.sourceforge.net
Subject: Re: [CLucene-dev] Wildcard query on a Russian text is not working for 
me

Hi,

thanks a lot for the hints. Changing the locale did not work, but now I have a 
better understanding, and I could make some hack for "fixing" the 
StandardTokenizer.

Федера́ция

here the а́ character is actually split to а and  ́   where the last one 
(0x0301 Combining Acute Accent) is not considered alphanumerical by the 
_istalnum(ch) function.

#define ALNUM         (_istalnum(ch) != 0)

thanks for the help, have a nice day!

On Wed, 24 Jul 2019 at 15:57, Kostka Bořivoj 
<kos...@tovek.cz<mailto:kos...@tovek.cz>> wrote:
Hi,

The problem should be in StandardTokenizer. Unfortunately I’m not familiar with 
it, as we are using our own tokenizer.
So I’m just guessing.

1)      It uses _istspace which is mapped to iswspace. Some time ago I 
discovered these function uses standard “C” locale by default (and doesn’t work 
well with non-english characters)

We solved this problem by calling setlocale( LC_CTYPE, "" ) during program 
startup. No idea if this helps, but it is easy to try.

2)      I have really bad experience with non-ascii characters inside source 
code, especially in multiplatform environment we use (windows + linux). It 
should work OK if file is in UTF-8, but we still had BOM/without BOM issues. We 
encode characters as \uNNNN if we need it in source (there is free online 
converters, like 
https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php

Borek

From: Tamás Dömők [mailto:domokta...@gmail.com<mailto:domokta...@gmail.com>]
Sent: Wednesday, July 24, 2019 3:18 PM
To: 
clucene-developers@lists.sourceforge.net<mailto:clucene-developers@lists.sourceforge.net>
Subject: Re: [CLucene-dev] Wildcard query on a Russian text is not working for 
me

Hi,

i checked my index with Luke. These are the tokens in my index:

1 content официально
1 content росси
1 content также
1 content федера
1 content ция
1 content я
1 content йская

It's interesting the word  Федера́ция  is split to федера and ция.

Yes TCHAR is defined as wchar_t on my platform(s) (it behaves exactly the same 
on mac, linux and windows for me.)

Thanks for this Luke tool, it's awesome.

On Wed, 24 Jul 2019 at 14:49, Kostka Bořivoj 
<kos...@tovek.cz<mailto:kos...@tovek.cz>> wrote:
Hi again

It would be interesting to explore index content. Seems to me, the the word 
“Федера́ция” is treated as two words Федер and ция (а́ is treated as space in 
other words).
You can use Luke (https://code.google.com/archive/p/luke/downloads) to explore 
index content

Regards

Borek

From: Tamás Dömők [mailto:domokta...@gmail.com<mailto:domokta...@gmail.com>]
Sent: Wednesday, July 24, 2019 11:41 AM
To: 
clucene-developers@lists.sourceforge.net<mailto:clucene-developers@lists.sourceforge.net>
Subject: [CLucene-dev] Wildcard query on a Russian text is not working for me

Hi all,

I'm trying to index some Russian content and search in this content using the 
CLucene library (v2.3.3.4-10). It works most of the time, but on some words the 
wildcard query is not working for me, and I have no idea why.

Can anybody help me on this, please?

Here is my source code:

main.cc:

#include <QCoreApplication>

#include <QString>

#include <QDebug>

#include <QScopedPointer>

#include <CLucene.h>

const TCHAR FIELD_CONTENT[] = L"content";

const char INDEX_PATH[] = "/tmp/index";

void create_index(const QString &content)

{

  lucene::analysis::standard::StandardAnalyzer analyzer;

  lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true);

  lucene::document::Document doc;

  std::wstring content_buffer = content.toStdWString();

  doc.add(*_CLNEW lucene::document::Field(FIELD_CONTENT,

                                          content_buffer.data(),

                                          lucene::document::Field::STORE_NO |

lucene::document::Field::INDEX_TOKENIZED |

lucene::document::Field::TERMVECTOR_NO,

                                          true));

  writer.addDocument(&doc);

  writer.flush();

  writer.close(true);

}

void search(const QString &query_string)

{

  lucene::search::IndexSearcher searcher(INDEX_PATH);

  lucene::analysis::standard::StandardAnalyzer analyzer;

  lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer);

  parser.setAllowLeadingWildcard(true);

  std::wstring query = query_string.toStdWString();

  QScopedPointer< lucene::search::Query > 
lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer));

  QScopedPointer< lucene::search::Hits > 
hits(searcher.search(lucene_query.data()));

  TCHAR *query_debug_string(lucene_query->toString());

  qDebug() << "found?" << query_string << 
QString::fromWCharArray(query_debug_string) << (hits->length() > 0);

  free(query_debug_string);

}

int main(int argc, char *argv[])

{

  QCoreApplication a(argc, argv);

  create_index(QString("Росси́я официально также Росси́йская Федера́ция"));

  search(QString("noWordLkeThis"));   // ok

  search(QString("Федера́ция"));       // ok

  search(QString("Федер*ция"));       // ERROR: it should work, but it doesn't

  search(QString("Фед*"));            // ok

  search(QString("Федер"));           // ok

  search(QString("\"федера ция\""));  // why is this working?

  search(QString("официально"));      // ok

  search(QString("офиц*ьно"));        // ok

  search(QString("оф*циально"));      // ok

  search(QString("офици*но"));        // ok

  return 0;

}

cluceneutf8.pro<http://cluceneutf8.pro>:

QT -= gui

CONFIG += c++11 console

CONFIG -= app_bundle

CONFIG += link_pkgconfig

PKGCONFIG += libclucene-core

SOURCES += \

        main.cc

qmake && make && ./cluceneutf8

The output of the program:

found? "noWordLkeThis" "content:nowordlkethis" false
found? "Федера́ция" "content:\"федера ция\"" true
found? "Федер*ция" "content:федер*ция" false
found? "Фед*" "content:фед*" true
found? "Федер" "content:федер" false
found? "\"федера ция\"" "content:\"федера ция\"" true
found? "официально" "content:официально" true
found? "офиц*ьно" "content:офиц*ьно" true
found? "оф*циально" "content:оф*циально" true
found? "офици*но" "content:офици*но" true

It's built with Qt and qmake, but I also made a non-Qt version if that would be 
better to share, I can.

So my problem is that I can search for Федера́ция but I can't search for 
Федер*ция for example. Other words like официально can be searched anyway.

Thanks.

--
Dömők Tamás
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net<mailto:CLucene-developers@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/clucene-developers

--
Dömők Tamás
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net<mailto:CLucene-developers@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/clucene-developers

--
Dömők Tamás

_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Re: [CLucene-dev] Wildcard query on a Russian text is not working for me

Reply via email to