On Sun, Apr 27, 2025 at 11:01:20PM -0400, Kent Overstreet wrote:
> On Sun, Apr 27, 2025 at 07:39:46PM -0700, Linus Torvalds wrote:
> > On Sun, 27 Apr 2025 at 19:22, Eric Biggers <ebigg...@kernel.org> wrote:
> > >
> > > I suspect that all that was really needed was case-insensitivity of ASCII 
> > > a-z.
> > 
> > Yes. That's my argument. I think anything else ends up being a
> > mistake. MAYBE extend it to the first 256 characters in Unicode (aka
> > "Latin1").
> > 
> > Case folding on a-z is the only thing you could really effectively
> > rely on in user space even in the DOS times, because different
> > codepages would make for different rules for the upper 128 characters
> > anyway, and you could be in a situation where you literally couldn't
> > copy files from one floppy to another, because two files that had
> > distinct names on one floppy would have the *same* name on another
> > one.
> > 
> > Of course, that was mostly a weird corner case that almost nobody ever
> > actually saw in practice, because very few people even used anything
> > else than the default codepage.
> > 
> > And the same is afaik still true on NT, although practically speaking
> > I suspect it went from "unusual" to "really doesn't happen EVER in
> > practice".
> 
> I'm having trouble finding anything authoritative, but what I'm seeing
> indicates that NTFS does do Unicode casefolding (and their own
> incompatible version, at that).

NTFS "just" has a 65536-entry table that maps UTF-16 coding units to their
"upper case" equivalents.  So it only does 1-to-1 codepoint mappings, and only
for U+FFFF and below.

I suspect that it's the same, or at least nearly the same, as what
https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt calls "simple"
casefolding (as opposed to "full" casefolding), but only for U+FFFF and below.

Of course, to implement the same with Linux's UTF-8 names, we won't be able to
just do a simple table lookup like Windows does.  But it could still be
implemented -- we'd just decode the Unicode codepoints from the string and apply
the same mapping from there.  Still much simpler than normalization.

- Eric

Reply via email to