On 5/8/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
And how is flŭd going?
Thanks for asking! I've been having a lot of fun with it. It certainly has taken me longer to get to where it is now than if I could devote myself full-time to it, but that has been good in a way -- I've had a lot of time to really think through its design. It is getting very close to being ready for another release that will be a bit less uber-alpha. :)
I like the documentation and blog, especially this entry: [1]. [1] http://flud.org/blog/2007/04/26/eradicating-service-outages-once-and-for-all/
Thanks. I'm sure it's all preaching to the choir -- old hat stuff to the p2p-hackers crowd, but it was largely inspired by personal experiences I've had with both using and trying to maintain centralized services. The sad truth is, I fear, that unnoticed service outages and/or data loss occur in centrally managed architectures far more frequently than is reported.
And I love your bibliography [2], especially the "Security / Robustness / Resilience" section. [2] http://www.flud.org/wiki/index.php/RelatedPapers
I hope it's a useful resource. The game theory stuff and fairness enforcing mechanisms (http://www.flud.org/wiki/index.php/Fairness) are areas I get a bit excited about, and it is rewarding to finally have enough pieces in place in flŭd to start really making that stuff work.
You're right on all counts. Allmydata is in the business of consumer backup, but as we design Tahoe, we're alert for opportunities to make it more flexible so that it can be extended to other purposes. So far this has worked well -- the way that we designed for alacrity and streaming, for example, hasn't caused any problems for "batch mode" backup and restore as far as I can tell.
It's always a series of tradeoffs with these things. In contrast to Tahoe, with flŭd I decided to focus the architecture squarely on backup. The common case for backup is writing data to the grid, which is the opposite of the common case for file sharing. In a pure backup system, an individual user will restore backed-up data very rarely, but will be sending data to the backup system throughout the day, every day, all the time. And on those rare occasions when a user does need to restore data, they are going to want a full copy of their file[s]. I suppose I'm claiming that features like partial file streaming are going to be of little value in that scenario, and if they impose unnecessary costs, could actually be a net negative. In light of Tahoe's more generic goals, it sounds like you believe these potential negatives are more than offset by the additional flexibility and future opportunity to leverage the architecture. Is that a fair assessment? To me it has always seemed useful to have distinct decentralized storage networks tuned for different usage scenarios. BitTorrent has probably influenced me a bit in that thinking; you'd never use BitTorrent for redundant archiving (unless you have permanently popular content ;) ), but what BT does well is what it has focused on: swarming cooperative download. "Grid Computing," which the press fawned over for some time, has also influenced my thinking, but in the opposite direction; here's a set of super generic distributed storage technologies, and yet it doesn't really seem to have caught on with anyone other than its big corporate creators. Maybe I have exaggerated the effect, but it seems like focused specialization benefits the backup scenario. Having said all that, I will admit that having a single storage technology that can be applied to multiple uses is attractive. Which is why I'm interested in how you deal with those tradeoffs and related performance implications. In other words, if it is possible to create separate decentralized storage systems that are tuned to specific usage scenarios, and in those scenarios yield significant (e.g., >2x) performance advantages over the generic architecture, couldn't that be a problem for Tahoe? If, on the other hand, such specialization yields no or only moderate improvement (e.g., 1.1x), then it seems that an approach like Tahoe's clearly wins the day (because absolute performance isn't everything; lots of high-performance technologies have been marginalized by their lower-performing competitors; "slightly better" is usually more than offset by "cheaper," "better marketed," and/or "popular + network effects.")
I see that you are using LDPC: [3]. I would be interested to see how it performs compared to my zfec library [4]
I've found that once a design problem convinces you to make a focused choice like I made with flŭd, it takes you down a certain road. LDPC vs RS was one of those choices that became very clear once I decided that flŭd was focused only on the backup use case, where encode-and-upload-time dominates download-and-decode-time. I don't remember off-hand by how many factors LDPC is faster than Rizzo for encoding, but when I was evaluating, it was several :) . Decoding is similarly faster, but the tradeoff is that for decode, LDPC might need to recover an extra chunk (or several). That seems like a very decent tradeoff, especially given that we can choose parameters to make guarantees like "you'll only need to recover 25/50" chunks to rebuild the file. This is magnified by the above observation that users' daily experience will include snappier backup and friendlier CPU burnage. The other thing that LDPC does for flŭd is provide efficient memory operations even over very large files without segmenting them. Whether encoding a 1K file or a 2GB file, each encodes to N+M data+parity blocks (it is possible to do this with RS, even with a 2^8 galois field, it just takes a lot more computation). The LDPC library simplifies the per-file metadata requirements and makes it possible to have a small, fixed size hashlist for every file, which has relevance to the Merkle Tree vs. Hash List discussion you and David have been having. For Tahoe, which has more generic goals, maybe this is not an advantage (since you want to chunk files into segments for reasons other than satisfying the encoder), but for flŭd it is (since it simplifies the design and improves performance). I'll have to play with zfec. I've always been impressed with how much performance Rizzo was able to squeeze out of Reed-Solomon. You've done a lot of people a great favor by providing a python wrapper. I've been meaning to pull the ldpc wrapper (which needs some serious rewriting, btw) out of flŭd and make it a separate release for some time now. Hope you don't mind if I follow your example. Alen
_______________________________________________ p2p-hackers mailing list [email protected] http://lists.zooko.com/mailman/listinfo/p2p-hackers
