On Mon, Sep 20, 2010 at 10:14 AM, Shawn Willden <sh...@willden.org> wrote: > > But Tahoe isn't really optimized for large grids. I'm not > sure how big the grid has to get before the overhead of all of the > additional queries to place/find shares begin to cause significant > slowdowns, but based on Zooko's reluctance to invite a lot more people > into the volunteer grid (at least, I've perceived such a reluctance), > I suspect that he doesn't want too many more than the couple of dozen > nodes we have now.
The largest Tahoe-LAFS grid that has ever existed as far as I know was the allmydata.com grid. It had about 200 storage servers, where each one was a userspace process which had exclusive access to one spinning disk. Each disk was 1.0 TB except a few were 1.5 TB, so the total raw capacity of the grid was a little more than 200 TB. It used K=3, M=10 erasure coding so the cooked capacity was around 70 GB. Most of the machines were 1U servers with four disks, a few were 2U servers with six, eight, or twelve disks, and one was a Sun Thumper with thirty-six disks. (Sun *gave* us that Thumper, which was only slightly out of date at the time, just because we were a cool open source storage startup. Those were the days.) Based on that experience, I'm sure that Tahoe-LAFS scales up to at least 200 nodes in the performance of queries to place/find shares. (Actually, those queries have been significantly optimized since then by Kevan and Brian for both upload and download of mutable files, so modern Tahoe-LAFS should perform even better in that setting.) However, I'm also sure that Tahoe-LAFS *failed* to scale up in a different way, and that other failure is why I jealously guard the secret entrance to the Volunteer Grid from passersby. The way that it failed to scale up was like this: suppose you use K=3, H=7, M=10 erasure-coding. Then the more nodes in the system the more likely it is to incur a simultaneous outage of 5 different nodes (H-K+1), which *might* render some files and directories unavailable. (That's because some files or directories might be on only H=7 different nodes. The failure of 8 nodes (M-K+1) will *definitely* render some files or directories unavailable.) In the allmydata.com case, this could happen due to the simultaneous outage of any two of the 1U servers, or any one of the bigger servers, for example [*]. In a friendnet such as the Volunteer Grid, this would happen if we had enough servers that occasionally their normal level of unavailability would coincide on five or more of them at once. Okay, that doesn't sound too good, but it isn't that bad. You could say to yourself that at least the rate of unavailable or even destroyed files, expressed as a fraction of the total number of files that your grid is serving, should be low. *But* there is another design decision that mixes with this one to make things really bad. That is: a lot of maintenance operations like renewing leases and checking-and-repairing files, not to mention retrieving your files for download, work by traversing through directories stored in the Tahoe-LAFS filesystem. Each Tahoe-LAFS directory (which is stored in a Tahoe-LAFS file) is independently randomly assigned to servers. See the problem? If you scale up the size of your grid in terms of servers *and* the size of your filesystem in terms of how many directories you have to traverse through in order to find something you want then you will eventually reach a scale where all or most of the things that you want are unreachable all or most of the time. This is what happened to allmydata.com, when their load grew and their financial and operational capacity shrank so that they couldn't replace dead hard drives, add capacity, and run deep-check-and-repair and deep-add-lease. I want to emphasize the scalability aspect of this problem. Sure anyone who has tight finances and high load can have problems, but in this case the problem was worse the larger the scale. All of the data that its customers entrusted to it is still in existence on those 200 disks, but to access *almost any* of that data requires getting *almost all* of those disks spinning at the same time, which is beyond allmydata.com's financial and operational capacity right now. So, I have to face the fact that we designed a system that is fundamentally non-scalable in this way. Hard to swallow. On the bright side, writing this letter has shown me a solution! Set M >= the number of servers on your grid (while keeping K/M the same as it was before). So if you have 100 servers on your grid, set K=30, H=70, M=100 instead of K=3, H=7, M=10! Then there is no small set of servers which can fail and cause any file or directory to fail. With that tweak in place, Tahoe-LAFS scales up to about 256 separate storage server processes, each of which could have at least 2 TB (a single SATA drive) or an arbitrarily large filesystem if you give it something fancier like RAID, ZFS, SAN, etc.. That's pretty scalable! Please learn from our mistake: we were originally thinking of the beautiful combinatorial math that shows you that with K=3, M=10, with a "probability of success of one server" being 0.9, you get some wonderful "many 9's" fault-tolerance. This is what Nicholas Nassim Taleb in "The Black Swan" calls The Ludic Fallacy. The Ludic Fallacy is to think that what you really care about can be predicted with a nice simple mathematical model. Brian, with a little help from Shawn, developed some slightly more complex mathematical models, such as this one that comes with Tahoe-LAFS: http://pubgrid.tahoe-lafs.org/reliability/ (Of course mathematical models are a great help for understanding. It isn't a fallacy to use them; it is a fallacy to think that they are a sufficient guide to action.) It would be a good exercise to understand how the allmydata.com experience fits into that "/reliability/" model or doesn't fit into it. I'm not sure, but I think that model must fail to account for the issues raised in this letter, because that model is "scale free"—there is no input to the model to specify how many hard drives or how many files (except indirectly in check_period which has to grow as the number of files grows), but my argument above persuades me that the architecture is not scale-free: if you add enough hard drives (exceeding your K,H,M parameters) and enough files then it is guaranteed to fail. Okay, I have more to say in response to Shawn's comments about grid management, but I think this letter is in danger of failing due to having scaled up during the course of this evening. ;-) So I will save some of my response for another letter. Regards, Zooko http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1199# document known scaling issues [*] Historical detail: allmydata.com was before Servers Of Happiness so they had it even worse—a file might have had all of its shares bunched up onto a single disk. _______________________________________________ tahoe-dev mailing list tahoe-dev@tahoe-lafs.org http://tahoe-lafs.org/cgi-bin/mailman/listinfo/tahoe-dev