https://bugs.kde.org/show_bug.cgi?id=362647

--- Comment #8 from GuHua <renyune...@gmail.com> ---
Stefan, thanks for the suggestion of writting test cases.

Yes, searching in character level (grapheme, if I understand this word
correctly) is far better than nothing.
To what I know (as a native Chinese speaker), many (or even most) Chinese
people are happy enough if the software can deal with things in character
level.


Actually, using a dictionary is still not enough for Chinese -- it often
happens that three (or more) character can be split in two different ways and
they both make sense without context.
A simple example of this scenario could be "化學生": both "化學" (chemistry) and
"學生" (student) make sense (moreover, sometimes "化學生" also makes sense, meaning
"a student whose major is in chemistry"), so context is the only way we can
tell how to correctly split them (e.g. "教化學生" will most likely be split into
"教化" [enlighten/teach] and "學生" [student]).
Correctly handling of Chinese words requires more sophisticated Natural
Language Processing techniques (e.g. using machine learning), and I think that
would be far beyond today's baloo (or maybe even any search / index engines).
(I have studied machine learning and natural language processing during my
masters, so it should be safe for me to say that today's NLP technique [for
Chinese word-splitting] is not yet good enough to be used in production
[compared with character level and judged in a user's sense, i.e. false
positive is better than false negative].)

Classical Chinese (this is a style of composing sentences and ways of
understanding characters / words, not like the different between "Traditional
Chinese" and "Simplified Chinese") makes the situation more difficult. Almost
all historical texts (e.g. history recordings / books / poems) (there are quite
a LOT) are written in Classical Chinese, and nowadays Chinese people still
study Classical Chinese and read those things (though we usually don't write in
Classical Chinese). Even humans may still need some effort to read a piece of
text written in Classical Chinese (but Classical Chinese is very very consise,
that's one of the reasons it exists).
However, in Classical Chinese, characters "are" words in many cases. Splitting
by characters is a very good choice.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to