Re: corrupted RAID1: unsuccessful recovery / help needed

Lukas Pirl Thu, 29 Oct 2015 14:44:32 -0700

TL;DR: thanks but recovery still preferred over recreation.


Hello Duncan and thanks for your reply!

On 10/26/2015 09:31 PM, Duncan wrote:

FWIW... Older btrfs userspace such as your v3.17 is "OK" for normal
runtime use, assuming you don't need any newer features, as in normal
runtime, it's the kernel code doing the real work and userspace for the
most part simply makes the appropriate kernel calls to do that work.

But, once you get into a recovery situation like the one you're in now,
current userspace becomes much more important, as the various things
you'll do to attempt recovery rely far more on userspace code directly
accessing the filesystem, and it's only the newest userspace code that
has the latest fixes.

So for a recovery situation, the newest userspace release (4.2.2 at
present) as well as a recent kernel is recommended, and depending on the
problem, you may at times need to run integration or apply patches on top
of that.

I am willing to update before trying further repairs. Is e.g. "balance"also influenced by the userspace tools or does the kernel the actual work?

General note about btrfs and btrfs raid.  Given that btrfs itself remains
a "stabilizing, but not yet fully mature and stable filesystem", while
btrfs raid will often let you recover from a bad device, sometimes that
recovery is in the form of letting you mount ro, so you can access the
data and copy it elsewhere, before blowing away the filesystem and
starting over.

If there is one subvolume that contains all other (read only) snapshotsand there is insufficient storage to copy them all separately:

Is there an elegant way to preserve those when moving the data across disks?

Back to the problem at hand.  Current btrfs has a known limitation when
operating in degraded mode.  That being, a btrfs raid may be write-
mountable only once, degraded, after which it can only be read-only
mounted.  This is because under certain circumstances in degraded mode,
btrfs will fall back from its normal raid mode to single mode chunk
allocation for new writes, and once there's single-mode chunks on the
filesystem, btrfs mount isn't currently smart enough to check that all
chunks are actually available on present devices, and simply jumps to the
conclusion that there's single mode chunks on the missing device(s) as
well, so refuses to mount writable after that in ordered to prevent
further damage to the filesystem and preserve the ability to mount at
least ro, to copy off what isn't damaged.

There's a patch in the pipeline for this problem, that checks individual
chunks instead of leaping to conclusions based on the presence of single-
mode chunks on a degraded filesystem with missing devices.  If that's
your only problem (which the backtraces might reveal but I as a non-dev
btrfs user can't tell), the patches should let you mount writable.


Interesting, thanks for the insights.

But that patch isn't in kernel 4.2.  You'll need at least kernel 4.3-rc,
and possibly btrfs integration, or to cherrypick the patches onto 4.2.

Well, before digging into that, a hint that this is actually the casewould be appreciated. :)

Meanwhile, in keeping with the admin's rule on backups, by definition, if
you valued the data more than the time and resources necessary for a
backup, by definition, you have a backup available, otherwise, by
definition, you valued the data less than the time and resources
necessary to back it up.

Therefore, no worries.  Regardless of the fate of the data, you saved
what your actions declared of most valuable to you, either the data, or
the hassle and resources cost of the backup you didn't do.  As such, if
you don't have a backup (or if you do but it's outdated), the data at
risk of loss is by definition of very limited value.

That said, it appears you don't even have to worry about loss of that
very limited value data, since mounting degraded,recovery,ro gives you
stable access to it, and you can use the opportunity provided to copy it
elsewhere, at least to the extent that the data we already know is of
limited value is even worth the hassle of doing that.

Which is exactly what I'd do.  Actually, I've had to resort to btrfs
restore[1] a couple times when the filesystem wouldn't mount at all, so
the fact that you can mount it degraded,recovery,ro, already puts you
ahead of the game. =:^)

So yeah, first thing, since you have the opportunity, unless your backups
are sufficiently current that it's not worth the trouble, copy off the
data while you can.

Then, unless you wish to keep the filesystem around in case the devs want
to use it to improve btrfs' recovery system, I'd just blow it away and
start over, restoring the data from backup once you have a fresh
filesystem to restore to.  That's the simplest and fastest way to a fully
working system once again, and what I did here after using btrfs restore
to recover the delta between current and my backups.

Thanks for all the elaborations. I guess there are also other validdefinitions of making backups out there – some that determine the amountand the types of redundancy by additionally taking factors like theanticipated risk of a failure or the severity of a failure intoconsideration.

However, you are perfectly correct with your advice tocreate/update/verify all backups as it is (still) possible.

Besides that, I'd still be willing to restore the file system and toprovide additional information to devs.


Cheers,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corrupted RAID1: unsuccessful recovery / help needed

Reply via email to