On Sun, Apr 27, 2025 at 11:01:20PM -0400, Kent Overstreet wrote: > On Sun, Apr 27, 2025 at 07:39:46PM -0700, Linus Torvalds wrote: > > On Sun, 27 Apr 2025 at 19:22, Eric Biggers <ebigg...@kernel.org> wrote: > > > > > > I suspect that all that was really needed was case-insensitivity of ASCII > > > a-z. > > > > Yes. That's my argument. I think anything else ends up being a > > mistake. MAYBE extend it to the first 256 characters in Unicode (aka > > "Latin1"). > > > > Case folding on a-z is the only thing you could really effectively > > rely on in user space even in the DOS times, because different > > codepages would make for different rules for the upper 128 characters > > anyway, and you could be in a situation where you literally couldn't > > copy files from one floppy to another, because two files that had > > distinct names on one floppy would have the *same* name on another > > one. > > > > Of course, that was mostly a weird corner case that almost nobody ever > > actually saw in practice, because very few people even used anything > > else than the default codepage. > > > > And the same is afaik still true on NT, although practically speaking > > I suspect it went from "unusual" to "really doesn't happen EVER in > > practice". > > I'm having trouble finding anything authoritative, but what I'm seeing > indicates that NTFS does do Unicode casefolding (and their own > incompatible version, at that).
NTFS "just" has a 65536-entry table that maps UTF-16 coding units to their "upper case" equivalents. So it only does 1-to-1 codepoint mappings, and only for U+FFFF and below. I suspect that it's the same, or at least nearly the same, as what https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt calls "simple" casefolding (as opposed to "full" casefolding), but only for U+FFFF and below. Of course, to implement the same with Linux's UTF-8 names, we won't be able to just do a simple table lookup like Windows does. But it could still be implemented -- we'd just decode the Unicode codepoints from the string and apply the same mapping from there. Still much simpler than normalization. - Eric