On Wed, Aug 28, 2019 at 07:21:14PM -0700, Sean Greenslade wrote:
> On August 28, 2019 5:51:02 PM PDT, Marc Oggier <marc.ogg...@megavolts.ch> 
> wrote:
> >Hi All,
> >
> >I am currently buidling a small data server for an experiment.
> >
> >I was wondering if the features of the spare volume introduced a couple
> >
> >of years ago (ttps://patchwork.kernel.org/patch/8687721/) would be 
> >release soon. I think this would be awesome to have a drive installed, 
> >that can be used as a spare if one drive of an array died to avoid
> >downtime.
> >
> >Does anyone have news about it, and when it will be officially in the 
> >kernel/btrfs-progs ?
> >
> >Marc
> >
> >P.S. It took me a long time to switch to btrfs. I did it less than a 
> >year ago, and I love it.  Keep the great job going, y'all
> 
> I've been thinking about this issue myself, and I have an (untested)
> idea for how to accomplish something similar. My file server has three
> disks in a btrfs raid1. I added a fourth disk to the array as just a
> normal, participating disk. I keep an eye on the usage to make sure
> that I never exceed 3 disk's worth of usage. That way, if one disk
> dies, there are still enough disks to mount RW (though I may still
> need to do an explicit degraded mount, not sure). In that scenario, I
> can just trigger an online full balance to rebuild the missing raid
> copies on the remaining disks. In theory, minimal to no downtime.
> 
> I'm curious if anyone can see any problems with this idea. I've never
> tested it, and my offsite backups are thorough enough to survive
> downtime anyway.
> 
> --Sean

I decided to do a bit of experimentation to test this theory. The
primary goal was to see if a filesystem could suffer a failed disk and
have that disk removed and rebalanced among the remaining disks without
the filesystem losing data or going read-only. Tested on kernel
5.2.5-arch1-1-ARCH, progs: v5.2.1.

I was actually quite impressed. When I ripped one of the block devices
out from under btrfs, the kernel started spewing tons of BTRFS errors,
but seemed to keep on trucking. I didn't leave it in this state for too
long, but I was reading, writing, and syncing the fs without issue.
After performing a btrfs device delete <MISSING_DEVID>, the filesystem
rebalanced and stopped reporting errors. Looks like this may be a viable
strategy for high-availability filesystems assuming you have adequate
monitoring in place to catch the disk failures quickly. I personally
wouldn't want to fully automate the disk deletion, but it's certainly
possible.

--Sean

Reply via email to