Re: [GIT PULL] bcachefs fixes for 6.15-rc4

Eric Biggers Sun, 27 Apr 2025 18:31:16 -0700

On Sun, Apr 27, 2025 at 08:55:30PM -0400, Kent Overstreet wrote:
> On Fri, Apr 25, 2025 at 08:40:35PM +0100, Matthew Wilcox wrote:
> > On Fri, Apr 25, 2025 at 09:35:27AM -0700, Linus Torvalds wrote:
> > > Now, if filesystem people were to see the light, and have a proper and
> > > well-designed case insensitivity, that might change. But I've never
> > > seen even a *whiff* of that. I have only seen bad code that
> > > understands neither how UTF-8 works, nor how unicode works (or rather:
> > > how unicode does *not* work - code that uses the unicode comparison
> > > functions without a deeper understanding of what the implications
> > > are).
> > > 
> > > Your comments blaming unicode is only another sign of that.
> > > 
> > > Because no, the problem with bad case folding isn't in unicode.
> > > 
> > > It's in filesystem people who didn't understand - and still don't,
> > > after decades - that you MUST NOT just blindly follow some external
> > > case folding table that you don't understand and that can change over
> > > time.
> > 
> > I think this is something that NTFS actually got right.  Each filesystem
> > carries with it a 128KiB table that maps each codepoint to its
> > case-insensitive equivalent.  So there's no ambiguity about "which
> > version of the unicode standard are we using", "Does the user care
> > about Turkish language rules?", "Is Aachen a German or Danish word?".
> > The sysadmin specified all that when they created the filesystem, and it
> > doesn't matter what the Unicode standard changes in the future; if you
> > need to change how the filesystem sorts things, you can update the table.
> > 
> > It's not the perfect solution, but it might be the least-bad one I've
> > seen.
> 
> The thing is, that's exactly what we're doing. ext4 and bcachefs both
> refer to a specific revision of the folding rules: for ext4 it's
> specified in the superblock, for bcachefs it's hardcoded for the moment.
> 
> I don't think this is the ideal approach, though.
> 
> That means the folding rules are "whatever you got when you mkfs'd".
> Think about what that means if you've got a fleet of machines, of
> different ages, but all updated in sync: that's a really annoying way
> for gremlins of the "why does this machine act differently" variety to
> creep in.
> 
> What I'd prefer is for the unicode folding rules to be transparently and
> automatically updated when the kernel is updated, so that behaviour
> stays in sync. That would behave more the way users would expect.
> 
> But I only gave this real thought just over the past few days, and doing
> this safely and correctly would require some fairly significant changes
> to the way casefolding works.
> 
> We'd have to ensure that lookups via the case sensitive name always
> works, even if the casefolding table the dirent was created with give
> different results that the currently active casefolding table.
> 
> That would require storing two different "dirents" for each real dirent,
> one normalized and one un-normalized, because we'd have to do an
> un-normalized lookup if the normalized lookup fails (and vice versa).
> Which should be completely fine from a performance POV, assuming we have
> working negative dentries.
> 
> But, if the unicode folding rules are stable enough (and one would hope
> they are), hopefully all this is a non-issue.
> 
> I'd have to gather more input from users of casefolding on other
> filesystems before saying what our long term plans (if any) will be.


Wouldn't lookups via the case-sensitive name keep working even if the
case-insensitivity rules change?  It's lookups via a case-insensitive name that
could start producing different results.  Applications can depend on
case-insensitive lookups being done in a certain way, so changing the
case-insensitivity rules can be risky.

Regardless, the long-term plan for the case-insensitivity rules should be to
deprecate the current set of rules, which does Unicode normalization which is
way overkill.  It should be replaced with a simple version of case-insensitivity
that matches what FAT does.  And *possibly* also a version that matches what
NTFS does (a u16 upcase_table[65536] indexed by UTF-16 coding units), if someone
really needs that.

As far as I know, that was all that was really needed in the first place.

People misunderstood the problem as being about language support, rather than
about compatibility with legacy filesystems.  And as a result they incorrectly
decided they should do Unicode normalization, which is way too complex and has
all sorts of weird properties.

- Eric

Re: [GIT PULL] bcachefs fixes for 6.15-rc4

Reply via email to