On 4/28/25 2:43 AM, Kent Overstreet wrote:
On Sun, Apr 27, 2025 at 06:30:59PM -0700, Eric Biggers wrote:
On Sun, Apr 27, 2025 at 08:55:30PM -0400, Kent Overstreet wrote:
The thing is, that's exactly what we're doing. ext4 and bcachefs both
refer to a specific revision of the folding rules: for ext4 it's
specified in the superblock, for bcachefs it's hardcoded for the moment.

I don't think this is the ideal approach, though.

That means the folding rules are "whatever you got when you mkfs'd".
Think about what that means if you've got a fleet of machines, of
different ages, but all updated in sync: that's a really annoying way
for gremlins of the "why does this machine act differently" variety to
creep in.

What I'd prefer is for the unicode folding rules to be transparently and
automatically updated when the kernel is updated, so that behaviour
stays in sync. That would behave more the way users would expect.

But I only gave this real thought just over the past few days, and doing
this safely and correctly would require some fairly significant changes
to the way casefolding works.

We'd have to ensure that lookups via the case sensitive name always
works, even if the casefolding table the dirent was created with give
different results that the currently active casefolding table.

That would require storing two different "dirents" for each real dirent,
one normalized and one un-normalized, because we'd have to do an
un-normalized lookup if the normalized lookup fails (and vice versa).
Which should be completely fine from a performance POV, assuming we have
working negative dentries.

But, if the unicode folding rules are stable enough (and one would hope
they are), hopefully all this is a non-issue.

I'd have to gather more input from users of casefolding on other
filesystems before saying what our long term plans (if any) will be.

Wouldn't lookups via the case-sensitive name keep working even if the
case-insensitivity rules change?  It's lookups via a case-insensitive name that
could start producing different results.  Applications can depend on
case-insensitive lookups being done in a certain way, so changing the
case-insensitivity rules can be risky.

No, because right now on a case-insensitive filesystem we _only_ do the
lookup with the normalized name.

Regardless, the long-term plan for the case-insensitivity rules should be to
deprecate the current set of rules, which does Unicode normalization which is
way overkill.  It should be replaced with a simple version of case-insensitivity
that matches what FAT does.  And *possibly* also a version that matches what
NTFS does (a u16 upcase_table[65536] indexed by UTF-16 coding units), if someone
really needs that.

As far as I know, that was all that was really needed in the first place.

People misunderstood the problem as being about language support, rather than
about compatibility with legacy filesystems.  And as a result they incorrectly
decided they should do Unicode normalization, which is way too complex and has
all sorts of weird properties.

Believe me, I do see the appeal of that.

One of the things I should really float with e.g. Valve is the
possibility of providing tooling/auditing to make it easy to fix
userspace code that's doing lookups that only work with casefolding.

This is not really about fixing userspace code that expects casefolding, or providing some form of stopgap there.

The main need there is Proton/Wine, which is a compat layer for Windows apps, which needs to pretend it's on NTFS and everything there expects casefolding to work.

No auditing/tooling required, we know the problem. It is unavoidable.

I agree with the calling about Unicode normalization being odd though, when I was implementing casefolding for bcachefs, I immediately thought it was a huge hammer to do full normalization for the intended purpose, and not just a big table...

FWIR, there is actually two forms of casefolding in unicode, full casefolding, C+F, (eg. ß->ss) and the simpler one, simple casefolding (C+S), where lengths don't change and it's glyph for glyph.

- Autumn ✨


And, another thing I'd like is a way to make casefolding per-process, so
that it could be opt-in for the programs that need it - so that new code
isn't accidentally depending on casefolding.

That's something we really should have, anyways.

But, as much as we might hate it, casefolding is something that users
like and do expect in other contexts, so if casefolding is going to
exist (as more than just a compatibility thing for legacy code) - it
really ought to be unicode, and utf8 really has won at this point.

Mainly though, it's not a decision I care to revisit, I intend to stick
with casefolding that's compatible with how it's done on our other
filesystems where it's widely used.




Reply via email to