On Tue, Jul 01, 2025 at 10:43:11AM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <kent.overstr...@linux.dev> writes:
> 
> I wasn't sure if I wanted to chime in here, or even if it would be
> worth it.  But whatever.
> 
> > On Thu, Jun 26, 2025 at 08:21:23PM -0700, Linus Torvalds wrote:
> >> On Thu, 26 Jun 2025 at 19:23, Kent Overstreet <kent.overstr...@linux.dev> 
> >> wrote:
> >> >
> >> > per the maintainer thread discussion and precedent in xfs and btrfs
> >> > for repair code in RCs, journal_rewind is again included
> >> 
> >> I have pulled this, but also as per that discussion, I think we'll be
> >> parting ways in the 6.17 merge window.
> >> 
> >> You made it very clear that I can't even question any bug-fixes and I
> >> should just pull anything and everything.
> 
> > Linus, I'm not trying to say you can't have any say in bcachefs. Not at
> > all.
> 
> > I positively enjoy working with you - when you're not being a dick,
> > but you can be genuinely impossible sometimes. A lot of times...
> 
> Kent, you can be a dick too.  Prime example, the lines above.  And
> how you've treated me and others who gave feedback on bcachefs in the
> past.  I'm not a programmer, I'm in IT and follow this because it's
> interesting, and I've been doing data management all my career.  So
> new filesystems are interesting.  

Oh yes, I can be. I apologize if I've been a dick to you personally, I
try to be nice to my users and build good working relationships. But
kernel development is a high stakes, high pressure, stressful job, as I
often remind people. I don't ever take it personally, although sometimes
we do need to cool off before we drive each other completely mad :)

If there was something that was unresolved, and you'd like me to look at
it again, I'd be more than happy to. If you want to share what you were
hitting here, I'll tell you what I know - and if it was from a year or
more ago it's most likely been fixed.

> Slow down.  

This is the most critical phase in the 10+ year process of shipping a
new filesystem.

We're seeing continually increasing usage (hopefully by users who are
prepared to accept that risk, but not always!), but we're not yet ready
for true widespread deployment.

Shipping a project as large and complex as a filesystem must be done
incrementally, in stages where we're deploying to gradually increasing
numbers of users, fixing everything they find and assessing where we're
at before opening it up to more users.

Working with users, supporting with them, checking in on how it's doing,
and getting them the fixes for what they find is how we iterate and
improve. The job is not done until it's working well for everyone.

Right now, everyone is concerned because this is a hotly anticipated
project, and everyone wants to see it done right.

And in 6.16, we had two massive pull requests (30+ patches in a week,
twice in a row); that also generates concern when people are wondering
"is this thing stabilizing?".

6.16 was largely a case of a few particularly interesting bug reports
generating a bunch of fixes (and relatively simple and localized fixes,
which is what we like to see) for repair corner cases, the biggest
culprit (again) being snapshots.

If you look at the bug tracker, especially rate of incoming bugs and the
severity of bug reports (and also other sources of bug reports, like
reddit and IRC) - yes, we are stabilizing fast.

There is still a lot of work to be done, but we're on the right track.

"Slowing down" is not something you do without a concrete reason. Right
now we need to be getting those fixes out to users so they can keep
testing and finding the next bug. When someone has invested time and
effort learning how the system works and how to report bugs, we don't
watn them getting frustrated and leaving - we want to work with them, so
they can keep testing and finding new bugs.

The signals that would tell me it's time to slow down are:

- Regressions getting through (quantity, severity, time spent on fixing
  them)
- Bugs getting through that show that show that something fundamental is
  missing (testing, hardening), or broken in our our design.
- Frequency of bug reports going up to where I can't keep up (it's been
  in steady, gradual decline)

We actually do not want this to be 100% perfect before it sees users.
That would result in a filesystem that's brittle - a glass cannon. We
might get it to the point where it works 99% of the time, but then when
it breaks we'd be in a panic - and if you discover it then, when it's in
the wild, it's too late.

The processes for how we debug and recover from failures, in the wild,
is a huge part (perhaps the majority) of what we're working on now. That
stuff has to be baked into the design on a deep level, and like all
other complex design it requires continual iteration.

That is how we'll get the reliability and robustness we hope to achieve.

Reply via email to