On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen <nlh...@gmail.com> wrote: > Could you explain to me what this line means: > “ > 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 > 0301;;;;N;;;1EA4;;1EA4 > “ > > If you could give me an example of adding a rule for “recursive” case, I can > do the rest. I am not familiar with this unaccent format generation yet.
So contrib/unaccent/generate_unaccent_rules.py is a Python script that takes UnicodeData.txt, a list of information about all Unicode codepoints available at a URL that is shown in a comment, and generates unaccent.rules. The idea was to avoid having to change it manually every time someone finds characters that should be in there (as you have just done!) by doing it systematically. Unicode has two ways to represent characters with accents: either with composed codepoints like "é" or decomposed codepoints where you say "e" and then "´". The field "00E2 0301" is the decomposed form of that character above. Our job here is to identify the basic letter that each composed character contains, by analysing the decomposed field that you see in that line. I failed to realise that characters with TWO accents are described as a composed character with ONE accent plus another accent. You don't have to worry about decoding that line, it's all done in that Python script. The problem is just in the function is_letter_with_marks(). Instead of just checking if combining_ids[0] is a plain letter, it looks like it should also check if combining_ids[0] itself is a letter with marks. Also get_plain_letter would need to be able to recurse to extract the "a". I hope that helps! -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers