Hi again

It would be interesting to explore index content. Seems to me, the the word 
“Федера́ция” is treated as two words Федер and ция (а́ is treated as space in 
other words).
You can use Luke (https://code.google.com/archive/p/luke/downloads) to explore 
index content

Regards

Borek

From: Tamás Dömők [mailto:domokta...@gmail.com]
Sent: Wednesday, July 24, 2019 11:41 AM
To: clucene-developers@lists.sourceforge.net
Subject: [CLucene-dev] Wildcard query on a Russian text is not working for me

Hi all,

I'm trying to index some Russian content and search in this content using the 
CLucene library (v2.3.3.4-10). It works most of the time, but on some words the 
wildcard query is not working for me, and I have no idea why.

Can anybody help me on this, please?

Here is my source code:

main.cc:


#include <QCoreApplication>



#include <QString>

#include <QDebug>

#include <QScopedPointer>



#include <CLucene.h>



const TCHAR FIELD_CONTENT[] = L"content";

const char INDEX_PATH[] = "/tmp/index";



void create_index(const QString &content)

{

  lucene::analysis::standard::StandardAnalyzer analyzer;

  lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true);



  lucene::document::Document doc;

  std::wstring content_buffer = content.toStdWString();

  doc.add(*_CLNEW lucene::document::Field(FIELD_CONTENT,

                                          content_buffer.data(),

                                          lucene::document::Field::STORE_NO |

                                          
lucene::document::Field::INDEX_TOKENIZED |

                                          
lucene::document::Field::TERMVECTOR_NO,

                                          true));

  writer.addDocument(&doc);



  writer.flush();

  writer.close(true);

}



void search(const QString &query_string)

{

  lucene::search::IndexSearcher searcher(INDEX_PATH);



  lucene::analysis::standard::StandardAnalyzer analyzer;

  lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer);

  parser.setAllowLeadingWildcard(true);



  std::wstring query = query_string.toStdWString();

  QScopedPointer< lucene::search::Query > 
lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer));

  QScopedPointer< lucene::search::Hits > 
hits(searcher.search(lucene_query.data()));



  TCHAR *query_debug_string(lucene_query->toString());

  qDebug() << "found?" << query_string << 
QString::fromWCharArray(query_debug_string) << (hits->length() > 0);

  free(query_debug_string);

}



int main(int argc, char *argv[])

{

  QCoreApplication a(argc, argv);



  create_index(QString("Росси́я официально также Росси́йская Федера́ция"));



  search(QString("noWordLkeThis"));   // ok



  search(QString("Федера́ция"));       // ok

  search(QString("Федер*ция"));       // ERROR: it should work, but it doesn't

  search(QString("Фед*"));            // ok

  search(QString("Федер"));           // ok

  search(QString("\"федера ция\""));  // why is this working?



  search(QString("официально"));      // ok

  search(QString("офиц*ьно"));        // ok

  search(QString("оф*циально"));      // ok

  search(QString("офици*но"));        // ok



  return 0;

}

cluceneutf8.pro<http://cluceneutf8.pro>:


QT -= gui



CONFIG += c++11 console

CONFIG -= app_bundle



CONFIG += link_pkgconfig

PKGCONFIG += libclucene-core



SOURCES += \

        main.cc


qmake && make && ./cluceneutf8

The output of the program:

found? "noWordLkeThis" "content:nowordlkethis" false
found? "Федера́ция" "content:\"федера ция\"" true
found? "Федер*ция" "content:федер*ция" false
found? "Фед*" "content:фед*" true
found? "Федер" "content:федер" false
found? "\"федера ция\"" "content:\"федера ция\"" true
found? "официально" "content:официально" true
found? "офиц*ьно" "content:офиц*ьно" true
found? "оф*циально" "content:оф*циально" true
found? "офици*но" "content:офици*но" true


It's built with Qt and qmake, but I also made a non-Qt version if that would be 
better to share, I can.

So my problem is that I can search for Федера́ция but I can't search for 
Федер*ция for example. Other words like официально can be searched anyway.


Thanks.

--
Dömők Tamás
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Reply via email to