On Mon, Jun 29, 2020 at 10:26:37AM -0600, Chris Murphy wrote:
> You've got an example where 'btrfs restore' saw no files at all? And
> you think it's the file system rather than the hardware, why?

Because the system failed to boot up, and even after offline repair 
attempts was still missing a sufficiently large chunk of the root 
filesystem to necessitate re-installation.

Because the same hardware provided literally years of problem-free 
stability with ext4 (before) and xfs (after).

> I think this is the wrong metaphor because it suggests btrfs caused
> the crapping. The sequence is: btrfs does the right thing, drive
> firmware craps itself and there's a power failure or a crash. Btrfs in
> the ordinary case doesn't care and boots without complaint. In the far

The first time, I needed to physically move the system, so the machine 
was shut down via 'shutdown -h now' on a console, and didn't come back 
up.

The second time was a routine post-dnf-update 'reboot', without power 
cycling anything.

At no point was there ever any unclean shutdown, and at the time of 
those reboots, no errors were reported in the kernel logs.

Once is a fluke, twice is a trend... and I didn't have the patience for 
a third try because I needed to be able to rely on the system to not eat 
itself.

I can't get the complete details at the moment, but it was an AMD E-350 
system with an 32GB ADATA SATA, configured using anaconda's btrfs 
defaults and only about 30% of disk space used.  Pretty minimal I/O.

I will concede that it's possible there was/is some sort 
hardware/firmware bug, but if so, only btrfs seemed to trigger it.

(more on this later)

> Come on. It's cleanly unmounted and doesn't mount?

Yes.  (See above)

(Granted, I'm using "mount" to mean "successfully mounted a writable 
 filesystem with data largely intact" -- I'm a bit fuzzy on the exact 
 details but I believe the it did mount read-only before the boot 
 crapped out due to missing/inaccessable system libraries.  I had to 
 resort to a USB stick to attempt repairs that were only partially 
 successful)

> All file systems have write ordering expectations. If the hardware
> doesn't honor that, it's trouble if there's a crash. What you're
> describing is 100% a hardware crapped itself case. You said it cleanly
> unmounted i.e. the exact correct write ordering did happen. And yet
> the file system can't be mounted again. That's a hardware failure.

That may be the case, but when there were no crashes, and neither ext4 
nor xfs crapped themselves under day-to-day operation with the same 
hardware, it's reasonable to infer that the problem has _something_ to 
do with the variable that changed, ie btrfs.

> There is no way for one person to determine if Btrfs is ready. That's
> done by combination of synthetic tests (xfstests) and volume
> regression testing on actual workloads. And by the way the Red Hat CKI
> project is going to help run btrfs xfstests for Fedora kernels.

Of course not, but the Fedora commnuity is made up of innumerable "one 
persons" each responsible for several special snowflake systems.

Let's say for sake of argument that my bad btrfs experiences were due to 
bugs in device firmware with btrfs's completely-legal usage patterns 
rather than bugs in btrfs-from-five-years-ago.  That's great... except 
my system still got trashed to the point of needing to be reinstalled, 
and finger-pointing can't bring back lost data.

How many more special snowflake drives are out there?  Think about how 
long it took Fedora to enable TRIM out of concern for potential data 
loss.  Why should this be any different?

(We're always going to be stuck with buggy firmware.  FFS, the Samsung 
 860 EVO SATA SSD that I have in my main workstation will hiccup to the 
 point of trashing data when used with AMD SATA controllers... even 
 under Windows!  Their official support answer is "Use an Intel 
 controller".  And that's a tier-one manufacturer who presumably has 
 among the best QA and support in the industry..)

If there is device/firmware known to be problematic, we need to keep 
track of these buggy devices and either automatically provide 
workarounds or some way to tell the user that proceeding with btrfs may 
be perilous to their data.

(Or perhaps the issues I had were due to bugs in btrfs-of-five-years-ago 
 that have long since been fixed.  Either way, given my twice-burned 
 experiences, I would want to verify that for myself before I entrust it 
 with any data I care about...)

> The questions are whether the Fedora community wants and is ready for
> Btrfs by default.

There are obviously some folks here (myself included) that have had very 
negative btrfs experiences.  Similarly, there are folks that have 
successfully overseen large-scale deployements of btrfs in their managed 
enviroments (not on Fedora though, IIUC)

So yes, I think an explicit "let's all test btrfs (as anaconda 
configures it) before we make it default" period is warranted.  

Perhaps one can argue that Fedora has already been doing that for the 
past two years (since 2018-or-later-btrfs is what everyone with positive 
results appears to be talking about), but it's still not clear that 
those deployments utilize the same feature set as Fedora's defaults, and 
how broad the hardware sample is.

 - Solomon
-- 
Solomon Peachy                        pizza at shaftnet dot org (email&xmpp)
                                      @pizza:shaftnet dot org   (matrix)
High Springs, FL                      speachy (freenode)

Attachment: signature.asc
Description: PGP signature

_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org

Reply via email to