Re: Improve the performance of Unicode Normalization Forms.

Nico Williams Fri, 20 Jun 2025 09:51:54 -0700

On Thu, Jun 19, 2025 at 10:41:57AM -0700, Jeff Davis wrote:
> In addition to the lookups themselves, there are other opportunities
> for optimization as well, such as:
> 
> * reducing the need for palloc and extra buffers, perhaps by using
> buffers on the stack for small strings
> 
> * operate more directly on UTF-8 data rather than decoding and re-
> encoding the entire string


Not sure how relevant it is here, but in ZFS in the 00s we implement
form-insensitive, from-preserving functionality with an interesting
optimization that greatly reduced allocations and which had a decent
fast path for ASCII-mostly strings.  It works by doing normalization
only in a) string comparison functions, b) string hashing functions
(because in ZFS directories are hash tables).

For b-trees one has to normalize the whole string, so perhaps this
scheme is just not that relevant to databases that use b-trees, but then
depending on how the b-tree key is smeared over internal nodes it might
be.

The ASCII-mostly fast path was that when comparing strings you want a
"cursor" through each string, and whenever the current and next bytes
are both ASCII you compare the current one the fast way (as ASCII), and
you only take the slow path of having to check if the current
_character_ (as opposed to _byte_) requires normalization when either
the current or next bytes are not ASCII.  In the slow path you only
normalize the _current character_, so you only need enough buffer space
for that.  In principle a Unicode character could consist of an
unbounded number of codepoints (zalgo), I think, but in practice it will
be no more than a half a dozen or so, so the buffer can be small and
stack allocated.  This same concept applies to string hashing: you walk
a cursor over the string and hash each ASCII _character_ and normalize
all non-ASCII characters, all one character at a time.

The really nice thing about form-insensitive/form-preserving
functionality is that it is form-preserving rather than normalizing on
create and lookup, and that makes the fast-path described above
feasible.  Whereas if you normalize on create and lookup you have to
heap allocate enough space for each string normalized.  The other nice
thing is that f-i/f-p behavior is a lot less surprising to users -- the
input methods they use don't matter.

What motivated this f-i/f-p behavior was that HFS+ used NFD (well,
something very close to NFD) but input methods (even on OS X) invariably
produce NFC (well, something close to it), at least for Latin scripts.
This means that on OS X if you use filenames from directory listings
those will be NFD while user inputs will be NFC, and so you can't just
memcmp()/strcmp() those -- you have to normalize _or_ use form-
insensitive string comparison, but nothing did that 20 years ago.  Thus
doing the form-insensitivity in the filesystem seemed best, and if you
do that you can be form-preserving to enable the optimization described
above.

My apologies if the above is not relevant to PG and thus noise,

Nico
--

Re: Improve the performance of Unicode Normalization Forms.

Reply via email to