Re: Is stability a joke?

Austin S. Hemmelgarn Mon, 12 Sep 2016 05:20:53 -0700

On 2016-09-11 09:02, Hugo Mills wrote:

On Sun, Sep 11, 2016 at 02:39:14PM +0200, Waxhead wrote:

Martin Steigerwald wrote:

Am Sonntag, 11. September 2016, 13:43:59 CEST schrieb Martin Steigerwald:

Thing is: This just seems to be when has a feature been implemented
matrix.
Not when it is considered to be stable. I think this could be done with
colors or so. Like red for not supported, yellow for implemented and
green for production ready.

Exactly, just like the Nouveau matrix. It clearly shows what you can
expect from it.

I mentioned this matrix as a good *starting* point. And I think it would be
easy to extent it:


Just add another column called "Production ready". Then research / ask about
production stability of each feature. The only challenge is: Who is
authoritative on that? I´d certainly ask the developer of a feature, but I´d
also consider user reports to some extent.

Maybe thats the real challenge.

If you wish, I´d go through each feature there and give my own estimation. But
I think there are others who are deeper into this.

That is exactly the same reason I don't edit the wiki myself. I
could of course get it started and hopefully someone will correct
what I write, but I feel that if I start this off I don't have deep
enough knowledge to do a proper start. Perhaps I will change my mind
about this.


   Given that nobody else has done it yet, what are the odds that
someone else will step up to do it now? I would say that you should at
least try. Yes, you don't have as much knowledge as some others, but
if you keep working at it, you'll gain that knowledge. Yes, you'll
probably get it wrong to start with, but you probably won't get it
*very* wrong. You'll probably get it horribly wrong at some point, but
even the more knowledgable people you're deferring to didn't identify
the problems with parity RAID until Zygo and Austin and Chris (and
others) put in the work to pin down the exact issues.

FWIW, here's a list of what I personally consider stable (as in, I'mwilling to bet against reduced uptime to use this stuff on productionsystems at work and personal systems at home):1. Single device mode, including DUP data profiles on single devicewithout mixed-bg.2. Multi-device raid0, raid1, and raid10 profiles with symmetricaldevices (all devices are the same size).

3. Multi-device single profiles with asymmetrical devices.

4. Small numbers (max double digit) of snapshots, taken at infrequentintervals (no more than once an hour). I use single snapshots regularlyto get stable images of the filesystem for backups, and I keep hourlyones of my home directory for about 48 hours.5. Subvolumes used to isolate parts of a filesystem from snapshots. Iuse this regularly to isolate areas of my filesystems from backups.6. Non-incremental send/receive (no clone source, no parent's, nodeduplication). I use this regularly for cloning virtual machines.

7. Checksumming and scrubs using any of the profiles I've listed above.
8. Defragmentation, including autodefrag.
9. All of the compat_features, including no-holes and skinny-metadata.

Things I consider stable enough that I'm willing to use them on mypersonal systems but not systems at work:1. In-line data compression with compress=lzo. I use this on my laptopand home server system. I've never had any issues with it myself, but Iknow that other people have, and it does seem to make other things morelikely to have issues.2. Batch deduplication. I only use this on the back-end filesystems formy personal storage cluster, and only because I have multiple copies asa result of GlusterFS on top of BTRFS. I've not had any significantissues with it, and I don't remember any reports of data loss resultingfrom it, but it's something that people should not be using if theydon't understand all the implications.


Things that I don't consider stable but some people do:

1. Quotas and qgroups. Some people (such as SUSE) consider these to bestable. There are a couple of known issues with them still however(such as returning the wrong errno when a quota is hit (should bereturning -EDQUOT, instead returns -ENOSPC)).2. RAID5/6. There are a few people who use this, but it's generallyagreed to be unstable. There are still at least 3 known bugs which cancause complete loss of a filesystem, and there's also a known issue withrebuilds taking insanely long, which puts data at risk as well.3. Multi device filesystems with asymmetrical devices running raid0,raid1, or raid10. The issue I have here is that it's much easier to hiterrors regarding free space than a reliable system should be. It'spossible to avoid with careful planning (for example, a 3 disk raid1profile with 1 disk exactly twice the size of the other two will workfine, albeit with more load on the larger disk).

There's probably some stuff I've missed, but that should cover most ofthe widely known features. The problem ends up becoming that what's'stable' depends a lot on what you consider stable. SUSE obviouslyconsiders qgroups stable (they're enabled by default in all current SUSEdistributions), but I wouldn't be willing to use them, and I'd bewilling to bet most of the developers wouldn't either.

As far as what I consider stable, I've been using just about everythingI listed above in the first two lists for the past year or so with noissues that were due to BTRFS itself (I've had some hardware issues, butBTRFS actually saved my data in those cases). I'm also not a typicaluser though, both in terms of use cases (I use LVM for storing VM imagesand then set ACL's on the device nodes so I can use them as a regularuser, and I do regular maintenance on all the databases on my systems),and relative knowledge of the filesystem (I've fixed BTRFS filesystemsby hand with a hex editor before, not something I ever want to do again,but I know I can do it if I need to), and both of those impact myconfidence in using some features.


   So I'd strongly encourage you to set up and maintain the stability
matrix yourself -- you have the motivation at least, and the knowledge
will come with time and effort. Just keep reading the mailing list and
IRC and bugzilla, and try to identify where you see lots of repeated
problems, and where bugfixes in those areas happen.

Exactly this. Most people don't start working on something for thefirst time with huge amounts of preexisting knowledge about it. Heavenknows I didn't both when I first started using Linux, and when I startedusing BTRFS. One of the big advantages of open source in this respectthough is that you generally can find people willing to help you withoutmuch effort, and there's generally relatively good support.

As far as documentation though, we [BTRFS] really do need to get our acttogether. It really doesn't look good to have most of the bestdocumentation be in the distro's wikis instead of ours. I'm not tryingto say the distros shouldn't be documenting BTRFS, but the point atwhich Debian (for example) has better documentation of the upstreamversion of BTRFS than the upstream project itself does, that starts tolook bad.


   So, go for it. You have a lot to offer the community.

   Hugo.

I do think for example that scrubbing and auto raid repair are stable, except
for RAID 5/6. Also device statistics and RAID 0 and 1 I consider to be stable.
I think RAID 10 is also stable, but as I do not run it, I don´t know. For me
also skinny-metadata is stable. For me so far even compress=lzo seems to be
stable, but well for others it may not.

Since what kernel version? Now, there you go. I have no idea. All I know I
started BTRFS with Kernel 2.6.38 or 2.6.39 on my laptop, but not as RAID 1 at
that time.

See, the implementation time of a feature is much easier to assess. Maybe
thats part of the reason why there is not stability matrix: Maybe no one
*exactly* knows *for sure*. How could you? So I would even put a footnote on
that "production ready" row explaining "Considered to be stable by developer
and user oppinions".

Of course additionally it would be good to read about experiences of corporate
usage of BTRFS. I know at least Fujitsu, SUSE, Facebook, Oracle are using it.
But I don´t know in what configurations and with what experiences. One Oracle
developer invests a lot of time to bring BTRFS like features to XFS and RedHat
still favors XFS over BTRFS, even SLES defaults to XFS for /home and other non
/-filesystems. That also tells a story.

Some ideas you can get from SUSE releasenotes. Even if you do not want to use
it, it tells something and I bet is one of the better sources of information
regarding your question you can get at this time. Cause I believe SUSE
developers invested some time to assess the stability of features. Cause they
would carefully assess what they can support in enterprise environments. There
is also someone from Fujitsu who shared experiences in a talk, I can search
the URL to the slides again.

By all means, SUSE's wiki is very valuable. I just said that I
*prefer* to have that stuff on the BTRFS wiki and feel that is the
right place for it.


I bet Chris Mason and other BTRFS developers at Facebook have some idea on
what they use within Facebook as well. To what extent they are allowed to talk
about it… I don´t know. My personal impression is that as soon as Chris went
to Facebook he became quite quiet. Maybe just due to being busy. Maybe due to
Facebook being concerned much more about the privacy of itself than of its
users.

Thanks,


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is stability a joke?

Reply via email to