Re: [HACKERS] Extra Vietnamese unaccent rules

Thomas Munro Fri, 26 May 2017 14:49:52 -0700

On Sat, May 27, 2017 at 9:09 AM, Kha Nguyen <[email protected]> wrote:
> Could you explain to me what this line means:
> “
> 1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
> 0301;;;;N;;;1EA4;;1EA4
> “
>
> If you could give me an example of adding a rule for “recursive” case, I can 
> do the rest. I am not familiar with this unaccent format generation yet.


So contrib/unaccent/generate_unaccent_rules.py is a Python script that
takes UnicodeData.txt, a list of information about all Unicode
codepoints available at a URL that is shown in a comment, and
generates unaccent.rules.  The idea was to avoid having to change it
manually every time someone finds characters that should be in there
(as you have just done!) by doing it systematically.

Unicode has two ways to represent characters with accents: either with
composed codepoints like "é" or decomposed codepoints where you say
"e" and then "´".  The field "00E2 0301" is the decomposed form of
that character above.  Our job here is to identify the basic letter
that each composed character contains, by analysing the decomposed
field that you see in that line.  I failed to realise that characters
with TWO accents are described as a composed character with ONE accent
plus another accent.

You don't have to worry about decoding that line, it's all done in
that Python script.  The problem is just in the function
is_letter_with_marks().  Instead of just checking if combining_ids[0]
is a plain letter, it looks like it should also check if
combining_ids[0] itself is a letter with marks.  Also get_plain_letter
would need to be able to recurse to extract the "a".

I hope that helps!

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

Reply via email to