Re: [CLucene-dev] Wildcard query on a Russian text is not working for me

Tamás Dömők Thu, 25 Jul 2019 04:18:45 -0700

Hi,

yes, I ended up removing the accents before processing it with CLucene.
https://unicode.org/reports/tr15/#Normalization_Forms_Table


    QString unaccent(const QString &s)

    {

      const QString normalized = s.normalized(QString::NormalizationForm_D);

      QString out;

      out.reserve(normalized.size());

      for (const QChar &c : normalized)

        {

          if (c.category() != QChar::Mark_NonSpacing &&

              c.category() != QChar::Mark_SpacingCombining &&

              c.category() != QChar::Mark_Enclosing)

            {

              out.append(c);

            }

        }

      out.squeeze();

      return out;

    }


I also tested with other languages with accents (hungarian for example), it
seems to be working. :)


On Thu, 25 Jul 2019 at 11:48, Kostka Bořivoj <kos...@tovek.cz> wrote:

> Hi,
>
>
>
> I’m quite sure standard tokenizer doesn’t support Unicode combining
> characters.
>
> The question is, how to process them.
>
> I think for Russian language the best way is simply to skip this character
> (create token text without this character), because it is just used to
> show, where is the accent in the word.
>
> Accent signs are hardly ever used in Russian texts a should be treated as
> the same word with or without them.
>
>
>
> Borek
>
>
>
> *From:* Tamás Dömők [mailto:domokta...@gmail.com]
> *Sent:* Wednesday, July 24, 2019 5:47 PM
> *To:* clucene-developers@lists.sourceforge.net
> *Subject:* Re: [CLucene-dev] Wildcard query on a Russian text is not
> working for me
>
>
>
> Hi,
>
>
>
> thanks a lot for the hints. Changing the locale did not work, but now I
> have a better understanding, and I could make some hack for "fixing" the
> StandardTokenizer.
>
>
> Федера́ция
>
>
> here the *а́ *character is actually split to *а* and * ́*   where the
> last one (0x0301 Combining Acute Accent) is not considered alphanumerical
> by the _istalnum(ch) function.
>
> #define ALNUM         (_istalnum(ch) != 0)
>
>
>
> thanks for the help, have a nice day!
>
>
>
> On Wed, 24 Jul 2019 at 15:57, Kostka Bořivoj <kos...@tovek.cz> wrote:
>
> Hi,
>
>
>
> The problem should be in StandardTokenizer. Unfortunately I’m not familiar
> with it, as we are using our own tokenizer.
>
> So I’m just guessing.
>
> 1)      It uses _istspace which is mapped to iswspace. Some time ago I
> discovered these function uses standard “C” locale by default (and doesn’t
> work well with non-english characters)
>
> We solved this problem by calling setlocale( LC_CTYPE, "" ) during program
> startup. No idea if this helps, but it is easy to try.
>
> 2)      I have really bad experience with non-ascii characters inside
> source code, especially in multiplatform environment we use (windows +
> linux). It should work OK if file is in UTF-8, but we still had BOM/without
> BOM issues. We encode characters as \uNNNN if we need it in source (there
> is free online converters, like
> https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php
>
>
>
> Borek
>
>
>
> *From:* Tamás Dömők [mailto:domokta...@gmail.com]
> *Sent:* Wednesday, July 24, 2019 3:18 PM
> *To:* clucene-developers@lists.sourceforge.net
> *Subject:* Re: [CLucene-dev] Wildcard query on a Russian text is not
> working for me
>
>
>
> Hi,
>
>
>
> i checked my index with Luke. These are the tokens in my index:
>
>
>
> 1 content официально
> 1 content росси
> 1 content также
> 1 content федера
> 1 content ция
> 1 content я
> 1 content йская
>
>
>
>
>
> It's interesting the word  *Федера́ция*  is split to *федера* and *ция*.
>
>
>
> Yes TCHAR is defined as wchar_t on my platform(s) (it behaves exactly the
> same on mac, linux and windows for me.)
>
>
>
> Thanks for this Luke tool, it's awesome.
>
>
>
>
>
> On Wed, 24 Jul 2019 at 14:49, Kostka Bořivoj <kos...@tovek.cz> wrote:
>
> Hi again
>
>
>
> It would be interesting to explore index content. Seems to me, the the
> word “Федера́ция” is treated as two words Федер and ция (а́ is treated as
> space in other words).
>
> You can use Luke (https://code.google.com/archive/p/luke/downloads) to
> explore index content
>
>
>
> Regards
>
>
>
> Borek
>
>
>
> *From:* Tamás Dömők [mailto:domokta...@gmail.com]
> *Sent:* Wednesday, July 24, 2019 11:41 AM
> *To:* clucene-developers@lists.sourceforge.net
> *Subject:* [CLucene-dev] Wildcard query on a Russian text is not working
> for me
>
>
>
> Hi all,
>
>
>
> I'm trying to index some Russian content and search in this content using
> the CLucene library (v2.3.3.4-10). It works most of the time, but on some
> words the wildcard query is not working for me, and I have no idea why.
>
>
>
> Can anybody help me on this, please?
>
>
>
> Here is my source code:
>
>
>
> *main.cc:*
>
>
>
> #include <QCoreApplication>
>
>
>
> #include <QString>
>
> #include <QDebug>
>
> #include <QScopedPointer>
>
>
>
> #include <CLucene.h>
>
>
>
> const TCHAR FIELD_CONTENT[] = L"content";
>
> const char INDEX_PATH[] = "/tmp/index";
>
>
>
> void *create_index*(const QString &content)
>
> {
>
>   lucene::analysis::standard::StandardAnalyzer analyzer;
>
>   lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true);
>
>
>
>   lucene::document::Document doc;
>
>   std::wstring content_buffer = content.toStdWString();
>
>   doc.add(***_CLNEW lucene::document::Field*(FIELD_CONTENT,*
>
>                                           *content_buffer.data(),*
>
>                                           
> lucene*::*document*::*Field*::*STORE_NO *|*
>
>                                           
> lucene*::*document*::*Field*::*INDEX_TOKENIZED *|*
>
>                                           
> lucene*::*document*::*Field*::*TERMVECTOR_NO*,*
>
>                                           true*)*);
>
>   writer.addDocument(&doc);
>
>
>
>   writer.flush();
>
>   writer.close(true);
>
> }
>
>
>
> void *search*(const QString &query_string)
>
> {
>
>   lucene::search::IndexSearcher searcher(INDEX_PATH);
>
>
>
>   lucene::analysis::standard::StandardAnalyzer analyzer;
>
>   lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer);
>
>   parser.setAllowLeadingWildcard(true);
>
>
>
>   std::wstring query = query_string.toStdWString();
>
>   QScopedPointer< lucene::search::Query > 
> lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer));
>
>   QScopedPointer< lucene::search::Hits > 
> hits(searcher.search(lucene_query.data()));
>
>
>
>   TCHAR *query_debug_string(lucene_query->toString());
>
>   qDebug() << "found?" << query_string << 
> QString::fromWCharArray(query_debug_string) << (hits->length() > 0);
>
>   free(query_debug_string);
>
> }
>
>
>
> int *main*(int argc, char *argv[])
>
> {
>
>   QCoreApplication a(*argc*, argv);
>
>
>
>   create_index(QString("Росси́я официально также Росси́йская Федера́ция"));
>
>
>
>   search(QString("noWordLkeThis"));   // ok
>
>
>
>   search(QString("Федера́ция"));       // ok
>
>   search(QString("Федер*ция"));       // ERROR: it should work, but it doesn't
>
>   search(QString("Фед*"));            // ok
>
>   search(QString("Федер"));           // ok
>
>   search(QString("\"федера ция\""));  // why is this working?
>
>
>
>   search(QString("официально"));      // ok
>
>   search(QString("офиц*ьно"));        // ok
>
>   search(QString("оф*циально"));      // ok
>
>   search(QString("офици*но"));        // ok
>
>
>
>   return 0;
>
> }
>
>
>
> *cluceneutf8.pro <http://cluceneutf8.pro>:*
>
>
>
> QT -= gui
>
>
>
> CONFIG += c++11 console
>
> CONFIG -= app_bundle
>
>
>
> CONFIG += link_pkgconfig
>
> PKGCONFIG += libclucene-core
>
>
>
> SOURCES += \
>
>         main.cc
>
>
>
>
>
> qmake && make && ./cluceneutf8
>
>
>
> *The output of the program:*
>
>
>
> found? "noWordLkeThis" "content:nowordlkethis" false
> found? "Федера́ция" "content:\"федера ция\"" true
> found? "Федер*ция" "content:федер*ция" false
> found? "Фед*" "content:фед*" true
> found? "Федер" "content:федер" false
> found? "\"федера ция\"" "content:\"федера ция\"" true
> found? "официально" "content:официально" true
> found? "офиц*ьно" "content:офиц*ьно" true
> found? "оф*циально" "content:оф*циально" true
> found? "офици*но" "content:офици*но" true
>
>
>
>
>
> It's built with Qt and qmake, but I also made a non-Qt version if that
> would be better to share, I can.
>
>
>
> So my problem is that I can search for *Федера́ция* but I can't search
> for *Федер*ция* for example. Other words like *официально* can be
> searched anyway.
>
>
>
>
>
> Thanks.
>
>
>
> --
>
> Dömők Tamás
>
> _______________________________________________
> CLucene-developers mailing list
> CLucene-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/clucene-developers
>
>
>
>
> --
>
> Dömők Tamás
>
> _______________________________________________
> CLucene-developers mailing list
> CLucene-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/clucene-developers
>
>
>
>
> --
>
> Dömők Tamás
> _______________________________________________
> CLucene-developers mailing list
> CLucene-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/clucene-developers
>


-- 
Dömők Tamás

_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Re: [CLucene-dev] Wildcard query on a Russian text is not working for me

Reply via email to