Description: Binary data
I am interested in this thread.
On Fri, May 26, 2017 at 5:48 PM, Thomas Munro
Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´". The field "00E2 0301" is the decomposed form of
that character above. Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line. I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.
Doesn't that depend on the NF operation you are working on? With a
canonical decomposition it seems to me that a character with two
accents can as well be decomposed with one character and two composing
character accents (NFKC does a canonical decomposition in one of its
You don't have to worry about decoding that line, it's all done in
that Python script. The problem is just in the function
is_letter_with_marks(). Instead of just checking if combining_ids
is a plain letter, it looks like it should also check if
combining_ids itself is a letter with marks. Also get_plain_letter
would need to be able to recurse to extract the "a".
Thanks for reporting and lecture about unicode.
I attached a patch as the instruction from Thomas. Could you confirm it.
Actually, with the recent work that has been done with
unicode_norm_table.h which has been to transpose UnicodeData.txt into
user-friendly tables, shouldn't the python script of unaccent/ be
replaced by something that works on this table? This does a canonical
decomposition but just keeps the first characters with a class
ordering of 0. So we have basic APIs able to look at UnicodeData.txt
and let caller do decision making with the result returned.
Thanks, i will learning about it.
Dang Minh Huong