On Thu, Jun 19, 2025 at 10:41:57AM -0700, Jeff Davis wrote: > In addition to the lookups themselves, there are other opportunities > for optimization as well, such as: > > * reducing the need for palloc and extra buffers, perhaps by using > buffers on the stack for small strings > > * operate more directly on UTF-8 data rather than decoding and re- > encoding the entire string
Not sure how relevant it is here, but in ZFS in the 00s we implement form-insensitive, from-preserving functionality with an interesting optimization that greatly reduced allocations and which had a decent fast path for ASCII-mostly strings. It works by doing normalization only in a) string comparison functions, b) string hashing functions (because in ZFS directories are hash tables). For b-trees one has to normalize the whole string, so perhaps this scheme is just not that relevant to databases that use b-trees, but then depending on how the b-tree key is smeared over internal nodes it might be. The ASCII-mostly fast path was that when comparing strings you want a "cursor" through each string, and whenever the current and next bytes are both ASCII you compare the current one the fast way (as ASCII), and you only take the slow path of having to check if the current _character_ (as opposed to _byte_) requires normalization when either the current or next bytes are not ASCII. In the slow path you only normalize the _current character_, so you only need enough buffer space for that. In principle a Unicode character could consist of an unbounded number of codepoints (zalgo), I think, but in practice it will be no more than a half a dozen or so, so the buffer can be small and stack allocated. This same concept applies to string hashing: you walk a cursor over the string and hash each ASCII _character_ and normalize all non-ASCII characters, all one character at a time. The really nice thing about form-insensitive/form-preserving functionality is that it is form-preserving rather than normalizing on create and lookup, and that makes the fast-path described above feasible. Whereas if you normalize on create and lookup you have to heap allocate enough space for each string normalized. The other nice thing is that f-i/f-p behavior is a lot less surprising to users -- the input methods they use don't matter. What motivated this f-i/f-p behavior was that HFS+ used NFD (well, something very close to NFD) but input methods (even on OS X) invariably produce NFC (well, something close to it), at least for Latin scripts. This means that on OS X if you use filenames from directory listings those will be NFD while user inputs will be NFC, and so you can't just memcmp()/strcmp() those -- you have to normalize _or_ use form- insensitive string comparison, but nothing did that 20 years ago. Thus doing the form-insensitivity in the filesystem seemed best, and if you do that you can be form-preserving to enable the optimization described above. My apologies if the above is not relevant to PG and thus noise, Nico --