Could you explain to me what this line means: “ 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 0301;;;;N;;;1EA4;;1EA4 “
If you could give me an example of adding a rule for “recursive” case, I can do the rest. I am not familiar with this unaccent format generation yet. Thanks Kha > On 26 May 2017, at 21.19, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > > On Sat, May 27, 2017 at 5:13 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> I wrote: >>> Nguyen Le Hoang Kha <nlh...@gmail.com> writes: >>>> Most of the time in Vietnamese language, there are up to 2 accents in a >>>> character. These unaccent rules are added to handle such cases (which are >>>> very common). >> >>> I can't see any reason not to add these --- any objections out there? >> >> Oh, wait a minute. Patching unaccent.rules directly isn't the way >> to do this; that file is supposed to be generated by >> generate_unaccent_rules.py. Can you see how to modify that script >> to produce these rules? > > Looking at one example from this patch: > > UTF8: <E1><BA><A5> > Codepoint: 1EA5 > Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE > > In UnicodData.txt it's this line: > > 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2 > 0301;;;;N;;;1EA4;;1EA4 > > The problem is that generate_unaccent_rules.py assumes that the > composing data is a plain letter followed by some number of > diacritical modifiers. That's true for the characters with a single > accent, but in this multi-accent case it's *composed* character 00E2 > (LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301 > (COMBINING ACCENT ACUTE). So we need to teach it to be recursive. > > -- > Thomas Munro > http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers