On 2016-09-12 05:48, Martin Steigerwald wrote:
Am Sonntag, 26. Juni 2016, 13:13:04 CEST schrieb Steven Haigh:
On 26/06/16 12:30, Duncan wrote:
> Steven Haigh posted on Sun, 26 Jun 2016 02:39:23 +1000 as excerpted:
>> In every case, it was a flurry of csum error messages, then instant
>> death.
>
> This is very possibly a known bug in btrfs, that occurs even in raid1
> where a later scrub repairs all csum errors. While in theory btrfs raid1
> should simply pull from the mirrored copy if its first try fails checksum
> (assuming the second one passes, of course), and it seems to do this just
> fine if there's only an occasional csum error, if it gets too many at
> once, it *does* unfortunately crash, despite the second copy being
> available and being just fine as later demonstrated by the scrub fixing
> the bad copy from the good one.
>
> I'm used to dealing with that here any time I have a bad shutdown (and
> I'm running live-git kde, which currently has a bug that triggers a
> system crash if I let it idle and shut off the monitors, so I've been
> getting crash shutdowns and having to deal with this unfortunately often,
> recently). Fortunately I keep my root, with all system executables, etc,
> mounted read-only by default, so it's not affected and I can /almost/
> boot normally after such a crash. The problem is /var/log and /home
> (which has some parts of /var that need to be writable symlinked into /
> home/var, so / can stay read-only). Something in the normal after-crash
> boot triggers enough csum errors there that I often crash again.
>
> So I have to boot to emergency mode and manually mount the filesystems in
> question, so nothing's trying to access them until I run the scrub and
> fix the csum errors. Scrub itself doesn't trigger the crash, thankfully,
> and once it has repaired all the csum errors due to partial writes on one
> mirror that either were never made or were properly completed on the
> other mirror, I can exit emergency mode and complete the normal boot (to
> the multi-user default target). As there's no more csum errors then
> because scrub fixed them all, the boot doesn't crash due to too many such
> errors, and I'm back in business.
>
>
> Tho I believe at least the csum bug that affects me may only trigger if
> compression is (or perhaps has been in the past) enabled. Since I run
> compress=lzo everywhere, that would certainly affect me. It would also
> explain why the bug has remained around for quite some time as well,
> since presumably the devs don't run with compression on enough for this
> to have become a personal itch they needed to scratch, thus its remaining
> untraced and unfixed.
>
> So if you weren't using the compress option, your bug is probably
> different, but either way, the whole thing about too many csum errors at
> once triggering a system crash sure does sound familiar, here.
Yes, I was running the compress=lzo option as well... Maybe here lays
a
common problem?
Hmm… I found this from being referred to by reading Debian wiki page on
BTRFS¹.
I use compress=lzo on BTRFS RAID 1 since April 2014 and I never found
an
issue. Steven, your filesystem wasn´t RAID 1 but RAID 5 or 6?
Yes, I was using RAID6 - and it has had a track record of eating data.
There's lots of problems with the implementation / correctness of
RAID5/6 parity - which I'm pretty sure haven't been nailed down yet. The
recommendation at the moment is just not to use RAID5 or RAID6 modes of
BTRFS. The last I heard, if you were using RAID5/6 in BTRFS, the
recommended action was to migrate your data to a different profile or a
different FS.
I just want to assess whether using compress=lzo might be dangerous to
use in
my setup. Actually right now I like to keep using it, since I think at
least
one of the SSDs does not compress. And… well… /home and / where I use
it are
both quite full already.
I don't believe the compress=lzo option by itself was a problem - but it
*may* have an impact in the RAID5/6 parity problems? I'd be guessing
here, but am happy to be corrected.
--
Steven Haigh
Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html