D11552: [WIP] Handle CJK characters

Michael Heidelbach Thu, 22 Mar 2018 14:39:12 -0700

michaelh added a comment.


  In D11552#231784 <https://phabricator.kde.org/D11552#231784>, @bruns wrote:
  
  > In D11552#231330 <https://phabricator.kde.org/D11552#231330>, @hein wrote:
  >
  > > For the record though - a better way to do this is to use 
QTextBoundaryFinder which will operate e.g. on grapheme cluster boundaries. 
This still isn't super great for Chinese though. If you want to really-properly 
do it you'll end up depending on ICU and using its BreakIterator combined with 
dict-based support for Chinese, which isn't terribly fast however.
  >
  >
  > There are a few implications here:
  >
  > - splitting to much generates to unspecific terms, especially in case of 
full text indexing (Think of splitting a western language at character level, 
most texts likely contain almost the full alphabet. Same likely applies to 
Katakana with its about ~100 graphemes)
  > - term generation at query and index time have to agree about what a term 
is, otherwise a search will likely return nothing. Changing the splitting at a 
later time will require reindexing all affected files
  > - better splitting will cost some more time at index generation, but likely 
makes searching faster (additional time for term generation will be neglegible, 
but the search terms are less complex - e.g. "abc" instead of "a" AND "b" AND 
"c").
  
  
  Currently `termgenerator` uses `QTextBoundaryFinder 
bf(QTextBoundaryFinder::Word, text);`

REPOSITORY
  R293 Baloo

REVISION DETAIL
  https://phabricator.kde.org/D11552

To: michaelh, hein
Cc: bruns, lbeltrame, #frameworks, alexeymin, cfeck, ashaposhnikov, michaelh, 
astippich, spoorun, nicolasfella, ngraham

D11552: [WIP] Handle CJK characters

Reply via email to