Re: [HACKERS] Extra Vietnamese unaccent rules

Dang Minh Huong Wed, 05 Jul 2017 08:46:37 -0700

On 2017/07/05 15:28, Michael Paquier wrote:

I have finally been able to look at this patch.


Thanks for reviewing and the new version of the patch.

(Surprised to see that generate_unaccent_rules.py is inconsistent on
MacOS, runs fine on Linux).

  def get_plain_letter(codepoint, table):
      """Return the base codepoint without marks."""
      if is_letter_with_marks(codepoint, table):
-        return table[codepoint.combining_ids[0]]
+        if len(table[codepoint.combining_ids[0]].combining_ids) > 1:
+            # Recursive to find the plain letter
+            return get_plain_letter(table[codepoint.combining_ids[0]],table)
+        elif is_plain_letter(table[codepoint.combining_ids[0]]):
+            return table[codepoint.combining_ids[0]]
+        else:
+            return None
      elif is_plain_letter(codepoint):
          return codepoint
      else:
-        raise "mu"
+        return None
The code paths returning None should not be reached, so I would
suggest adding an assertion instead. Callers of get_plain_letter would
blow up on None, still that would make future debugging harder.

  def is_letter_with_marks(codepoint, table):
-    """Returns true for plain letters combined with one or more marks."""
+    """Returns true for letters combined with one or more marks."""
      # See 
http://www.unicode.org/reports/tr44/tr44-14.html#General_Category_Values
      return len(codepoint.combining_ids) > 1 and \
-           is_plain_letter(table[codepoint.combining_ids[0]]) and \
+           (is_plain_letter(table[codepoint.combining_ids[0]]) or\
+            is_letter_with_marks(table[codepoint.combining_ids[0]],table))
and \
             all(is_mark(table[i]) for i in codepoint.combining_ids[1:]
This was already hard to follow, and this patch makes its harder. I
think that the thing should be refactored with multiple conditions.

              if is_letter_with_marks(codepoint, table):
-                charactersSet.add((codepoint.id,
+                if get_plain_letter(codepoint, table) <> None:
+                    charactersSet.add((codepoint.id,
This change is not necessary as a letter with marks is not a plain
character anyway.

Testing with characters having two accents, the results are produced
as wanted. I am attaching an updated patch with all those
simplifications. Thoughts?


Thanks, so pretty. The patch is fine to me.

---
Thanks and best regards,
Dang Minh Huong

---
このEメールはアバスト アンチウイルスによりウイルススキャンされています。
https://www.avast.com/antivirus



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra Vietnamese unaccent rules

Reply via email to