On 12/18/25 03:53, Shawn Rutledge wrote:
>> On Dec 17, 2025, at 22:17, Jacob Moody <[email protected]> wrote:
>>
>> I've been poking at some of the utf* functions lately and utfutf is a bit 
>> puzzling.
>> At face value, strstr() should be sufficient for handling utf8 encoded 
>> strings just as strcmp() is.
> 
> Maybe normalization could be the reason: there can be multiple 
> representations, for example, ü might be one code point (Unicode: U+00FC, 
> UTF-8: C3 BC), or might be u with a combining umlaut.  I would assume 
> converting to a rune would turn out the same either way: then you can compare 
> them even if the haystack is represented one way in utf8 and the needle is 
> the other way.  (Disclaimer: I’m not a unicode expert, even less so on 9)

No, normalization is completely orthogonal to this.
First of all, when these were written Plan 9 did not handle detached codepoints 
or decomposed sequences at all, so I'd
find it quite surprising if the intention was to handle them here (or in 
chartorune).
Also, from a design standpoint your UTF decoding is not the correct place 
implement normalization for a large number
of reasons, to name a few:

1. Normalization requires the context of multiple codepoints, would be quite 
complex for chartorune to do this as by
the standards definition a normalization context can technically be unbounded.
2. It would be quite surprising if you're goal is to read in a file and write 
it back out that you silently convert codepoints.
3. Normalization is not exactly cheap to perform, chartorune is in the hotpath 
of a lot of code.
4. One form is not inherently more correct than the other, the Unicode standard 
says you should treat both composed and decomposed forms as even.

If you want more context on specifically normalization, I wrote a paper about 
my normalization implementation for 9front that I presented at the last IWP9.


------------------------------------------
9fans: 9fans
Permalink: 
https://9fans.topicbox.com/groups/9fans/T8831073f8b8bb351-M0acd2a42356729165fa7d00b
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

Reply via email to