Re: [HACKERS] Radix tree for character conversion

Michael Paquier Wed, 25 Jan 2017 23:29:19 -0800

On Tue, Jan 10, 2017 at 8:22 PM, Kyotaro HORIGUCHI
<[email protected]> wrote:
> [...patch...]


Nobody has showed up yet to review this patch, so I am giving it a shot.

The patch file sizes are scary at first sight, but after having a look:
 36 files changed, 1411 insertions(+), 54398 deletions(-)
Yes that's a surprise, something like git diff --irreversible-delete
would have helped as most of the diffs are just caused by 3 files
being deleted in patch 0004, making 50k lines going to the abyss of
deletion.

> Hello, I found a bug in my portion while rebasing.

Right, that's 0001. Nice catch.

> The attached files are the following. This patchset is not
> complete missing changes of map files. The change is tremendously
> large but generatable.
>
> 0001-Add-missing-semicolon.patch
>
>   UCS_to_EUC_JP.pl has a line missing teminating semicolon. This
>   doesn't harm but surely a syntax error. This patch fixes it.
>   This might should be a separate patch.

This requires a back-patch. This makes me wonder how long this script
has actually not run...

> 0002-Correct-reference-resolution-syntax.patch
>
>   convutils.pm has lines with different syntax of reference
>   resolution. This unifies the syntax.

Yes that looks right to me. I am the best perl guru on this list but
looking around $$var{foo} is bad, ${$var}{foo} is better, and
$var->{foo} is even better. This also generates no diffs when running
make in src/backend/utils/mb/Unicode/. So no objections to that.

> 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch
>
>   Before adding radix tree stuff, applied pgperltidy and inserted
>   format-skipping pragma for the parts where perltidy seems to do
>   too much.

Which version of perltidy did you use? Looking at the archives, the
perl code is cleaned up with a specific version, v20090616. See
https://www.postgresql.org/message-id/[email protected]
for example on the matter. As perltidy changes over time, this may be
a sensitive change if done this way.

> 0004-Use-radix-tree-for-character-conversion.patch
>
>   Radix tree body.

Well, here a lot of diffs could have been saved.

> The unattached fifth patch is generated by the following steps.
>
> [$(TOP)]$ ./configure
> [Unicode]$ make
> [Unicode]$ make distclean
> [Unicode]$ git add .
> [Unicode]$ commit
> === COMMITE MESSSAGE
> Replace map files with radix tree files.
>
> These encodings no longer uses the former map files and uses new radix
> tree files.
> ===

OK, I can see that working, with 200k of maps generated.. So going
through the important bits of this jungle..

+/*
+ * radix tree conversion function - this should be identical to the function in
+ * ../conv.c with the same name
+ */
+static inline uint32
+pg_mb_radix_conv(const pg_mb_radix_tree *rt,
+                int l,
+                unsigned char b1,
+                unsigned char b2,
+                unsigned char b3,
+                unsigned char b4)
This is not nice. Having a duplication like that is a recipe to forget
about it as this patch introduces a dependency with conv.c and the
radix tree generation.

Having a .gitignore in Unicode/ would be nice, particularly to avoid
committing map_checker.

A README documenting things may be welcome, or at least comments at
the top of map_checker.c. Why is map_checker essential? What does it
do? There is no way to understand that easily, except that it includes
a "radix tree conversion function", and that it performs sanity checks
on the radix trees to be sure that they are on a good shape. But as
this something that one would guess only after looking at your patch
and the code (at least I will sleep less stupid tonight after reading
this stuff).

--- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl
 # Drop these SJIS codes from the source for UTF8=>SJIS conversion
 #<<< do not let perltidy touch this
-my @reject_sjis =(
+my @reject_sjis = (
    0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782,
-   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
+   0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797,
    0x879a..0x879c
-);
+   );
This is not generated, it would be nice to drop the noise from the patch.

Here is another one:
-       $i->{code} = $jis | (
-           $jis < 0x100
-           ? 0x8e00
-           : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
-
+#<<< do not let perltidy touch this
+       $i->{code} = $jis | ($jis < 0x100 ? 0x8e00:
+                            ($sjis >= 0xeffd ? 0x8f8080 : 0x8080));
+#>>>

        if (l == 2)
        {
-           iutf = *utf++ << 8;
-           iutf |= *utf++;
+           b3 = *utf++;
+           b4 = *utf++;
        }
Ah, OK. This conversion is important so as it performs a minimum of
bitwise operations. Yes let's keep that. That's pretty cool to get a
faster operation.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Radix tree for character conversion

Reply via email to