unaccent.patch
Description: Binary data
I am interested in this thread.
On Fri, May 26, 2017 at 5:48 PM, Thomas Munro < thomas.mu...@enterprisedb.com> wrote: Unicode has two ways to represent characters with accents: either with composed codepoints like "é" or decomposed codepoints where you say "e" and then "´". The field "00E2 0301" is the decomposed form of that character above. Our job here is to identify the basic letter that each composed character contains, by analysing the decomposed field that you see in that line. I failed to realise that characters with TWO accents are described as a composed character with ONE accent plus another accent.
Doesn't that depend on the NF operation you are working on? With a canonical decomposition it seems to me that a character with two accents can as well be decomposed with one character and two composing character accents (NFKC does a canonical decomposition in one of its steps). You don't have to worry about decoding that line, it's all done in that Python script. The problem is just in the function is_letter_with_marks(). Instead of just checking if combining_ids[0] is a plain letter, it looks like it should also check if combining_ids[0] itself is a letter with marks. Also get_plain_letter would need to be able to recurse to extract the "a".
Thanks for reporting and lecture about unicode. I attached a patch as the instruction from Thomas. Could you confirm it.
Actually, with the recent work that has been done with unicode_norm_table.h which has been to transpose UnicodeData.txt into user-friendly tables, shouldn't the python script of unaccent/ be replaced by something that works on this table? This does a canonical decomposition but just keeps the first characters with a class ordering of 0. So we have basic APIs able to look at UnicodeData.txt and let caller do decision making with the result returned. -- Michael
Thanks, i will learning about it.
--- Dang Minh Huong |