On Sun, Apr 27, 2025 at 08:55:30PM -0400, Kent Overstreet wrote: > On Fri, Apr 25, 2025 at 08:40:35PM +0100, Matthew Wilcox wrote: > > On Fri, Apr 25, 2025 at 09:35:27AM -0700, Linus Torvalds wrote: > > > Now, if filesystem people were to see the light, and have a proper and > > > well-designed case insensitivity, that might change. But I've never > > > seen even a *whiff* of that. I have only seen bad code that > > > understands neither how UTF-8 works, nor how unicode works (or rather: > > > how unicode does *not* work - code that uses the unicode comparison > > > functions without a deeper understanding of what the implications > > > are). > > > > > > Your comments blaming unicode is only another sign of that. > > > > > > Because no, the problem with bad case folding isn't in unicode. > > > > > > It's in filesystem people who didn't understand - and still don't, > > > after decades - that you MUST NOT just blindly follow some external > > > case folding table that you don't understand and that can change over > > > time. > > > > I think this is something that NTFS actually got right. Each filesystem > > carries with it a 128KiB table that maps each codepoint to its > > case-insensitive equivalent. So there's no ambiguity about "which > > version of the unicode standard are we using", "Does the user care > > about Turkish language rules?", "Is Aachen a German or Danish word?". > > The sysadmin specified all that when they created the filesystem, and it > > doesn't matter what the Unicode standard changes in the future; if you > > need to change how the filesystem sorts things, you can update the table. > > > > It's not the perfect solution, but it might be the least-bad one I've > > seen. > > The thing is, that's exactly what we're doing. ext4 and bcachefs both > refer to a specific revision of the folding rules: for ext4 it's > specified in the superblock, for bcachefs it's hardcoded for the moment. > > I don't think this is the ideal approach, though. > > That means the folding rules are "whatever you got when you mkfs'd". > Think about what that means if you've got a fleet of machines, of > different ages, but all updated in sync: that's a really annoying way > for gremlins of the "why does this machine act differently" variety to > creep in. > > What I'd prefer is for the unicode folding rules to be transparently and > automatically updated when the kernel is updated, so that behaviour > stays in sync. That would behave more the way users would expect. > > But I only gave this real thought just over the past few days, and doing > this safely and correctly would require some fairly significant changes > to the way casefolding works. > > We'd have to ensure that lookups via the case sensitive name always > works, even if the casefolding table the dirent was created with give > different results that the currently active casefolding table. > > That would require storing two different "dirents" for each real dirent, > one normalized and one un-normalized, because we'd have to do an > un-normalized lookup if the normalized lookup fails (and vice versa). > Which should be completely fine from a performance POV, assuming we have > working negative dentries. > > But, if the unicode folding rules are stable enough (and one would hope > they are), hopefully all this is a non-issue. > > I'd have to gather more input from users of casefolding on other > filesystems before saying what our long term plans (if any) will be.
Wouldn't lookups via the case-sensitive name keep working even if the case-insensitivity rules change? It's lookups via a case-insensitive name that could start producing different results. Applications can depend on case-insensitive lookups being done in a certain way, so changing the case-insensitivity rules can be risky. Regardless, the long-term plan for the case-insensitivity rules should be to deprecate the current set of rules, which does Unicode normalization which is way overkill. It should be replaced with a simple version of case-insensitivity that matches what FAT does. And *possibly* also a version that matches what NTFS does (a u16 upcase_table[65536] indexed by UTF-16 coding units), if someone really needs that. As far as I know, that was all that was really needed in the first place. People misunderstood the problem as being about language support, rather than about compatibility with legacy filesystems. And as a result they incorrectly decided they should do Unicode normalization, which is way too complex and has all sorts of weird properties. - Eric