On Sun, Apr 27, 2025 at 07:53:26PM -0700, Linus Torvalds wrote: > On Sun, 27 Apr 2025 at 19:34, Kent Overstreet <kent.overstr...@linux.dev> > wrote: > > > > Do you mean to say that we invented yet another incompatible unicode > > casefolding scheme? > > > > Dear god, why? > > Oh, Unicode itself comes with multiple "you can do this" schemes. > > It's designed by committee, and meant for different situations and > different uses. Because the rules for things like sorting names are > wildly different even for the same language, just for different > contexts. > > Think of Unicode as "several decades of many different people coming > together, all having very different use cases". > > So you find four different normalization forms, all with different use-cases.
I'm still dying to know why we had to invent our own, though. The proliferation of standards is just ridiculous. > And guess what? The only actual *valid* scheme for a filesystem is > none of the four. Literally. It's to say "we don't normalize". > > Because the normalization forms are not meant to be some kind of "you > should do this". > > They are meant as a kind of "if you are going to do X, then you can > normalize into form Y, which makes doing X easier". And often the > normalized form should only ever be an intermediate _temporary_ form > for doing comparisons, not the actual form you save things in. > > Sadly, people so often get it wrong. > > For example, one very typical "you got it wrong, because you didn't > understand the problem" case is to do comparisons by normalizing both > sides (in one of the normalization forms) and then doing the > comparison in that form. > > And guess what? 99.9% of the time, you just wasted enormous amounts of > time, because you could have done the comparison first *without* any > normalization at all, because equality is equality even when neither > side is normalized. Yeah, that's another point in favor of "index both the normalized and un-normalized form". i.e.: the normalized index is a special thing that doesn't have to exist, and we only check it if the lookup in the un-normalized index fails. Case-insensitive capable filesystems could act just like normal filesystems, unless specific pids opted into the extra "normalized lookups" path.