On Sun, Apr 27, 2025 at 07:53:26PM -0700, Linus Torvalds wrote:
> On Sun, 27 Apr 2025 at 19:34, Kent Overstreet <kent.overstr...@linux.dev> 
> wrote:
> >
> > Do you mean to say that we invented yet another incompatible unicode
> > casefolding scheme?
> >
> > Dear god, why?
> 
> Oh, Unicode itself comes with multiple "you can do this" schemes.
> 
> It's designed by committee, and meant for different situations and
> different uses.  Because the rules for things like sorting names are
> wildly different even for the same language, just for different
> contexts.
> 
> Think of Unicode as "several decades of many different people coming
> together, all having very different use cases".
> 
> So you find four different normalization forms, all with different use-cases.

I'm still dying to know why we had to invent our own, though. The
proliferation of standards is just ridiculous.

> And guess what? The only actual *valid* scheme for a filesystem is
> none of the four. Literally. It's to say "we don't normalize".
> 
> Because the normalization forms are not meant to be some kind of "you
> should do this".
> 
> They are meant as a kind of "if you are going to do X, then you can
> normalize into form Y, which makes doing X easier". And often the
> normalized form should only ever be an intermediate _temporary_ form
> for doing comparisons, not the actual form you save things in.
> 
> Sadly, people so often get it wrong.
> 
> For example, one very typical "you got it wrong, because you didn't
> understand the problem" case is to do comparisons by normalizing both
> sides (in one of the normalization forms) and then doing the
> comparison in that form.
> 
> And guess what? 99.9% of the time, you just wasted enormous amounts of
> time, because you could have done the comparison first *without* any
> normalization at all, because equality is equality even when neither
> side is normalized.

Yeah, that's another point in favor of "index both the normalized and
un-normalized form".

i.e.: the normalized index is a special thing that doesn't have to
exist, and we only check it if the lookup in the un-normalized index
fails.

Case-insensitive capable filesystems could act just like normal
filesystems, unless specific pids opted into the extra "normalized
lookups" path.

Reply via email to