On 28/01/2021 01:23, John Naylor wrote:
Hi Heikki,

0001 through 0003 are straightforward, and I think they can be committed now if you like.

0004 is also pretty straightforward. The check you proposed upthread for pg_upgrade seems like the best solution to make that workable. I'll take a look at 0005 soon.

I measured the conversions that were rewritten in 0003, and there is indeed a noticeable speedup:

Big5 to EUC-TW:

head    196ms
0001-3  152ms

EUC-TW to Big5:

head    190ms
0001-3  144ms

I've attached the driver function for reference. Example use:

select drive_conversion(
   1000, 'euc_tw'::name, 'big5'::name,
   convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
);

Thanks! I have committed patches 0001 and 0003 in this series, with minor comment fixes. Next I'm going to write the pg_upgrade check for patch 0004, to get that into a committable state too.

I took a look at the test suite also, and the only thing to note is a couple places where the comment doesn't match the code:

+  -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
+  byte1 = hex('0e');
+  for byte2 in hex('a1')..hex('df') loop
+    return next b(byte1, byte2);
+  end loop;
+
+  -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
+  byte1 = hex('0f');
+  for byte2 in hex('a1')..hex('fe') loop
+    for byte3 in hex('a1')..hex('fe') loop
+      return next b(byte1, byte2, byte3);
+    end loop;
+  end loop;

Not sure if it matters , but thought I'd mention it anyway.

Good catch! The comments were correct, and the tests were wrong, not testing those 2- and 3-byte encoded characters as intened. Doesn't matter for testing this patch, I only included those euc_jis_2004 tets for the sake of completeness, but if someone finds this test suite in the archives and want to use it for something real, make sure you fix that first.

- Heikki


Reply via email to