Mike Freedman pointed out that SET caught a bit of interest here, so I figured I'd answer a few of the questions that've come up. (I'm not on the list long-term, but will hang out for a sec if there are questions.)

He Yuan asked: "I think a central index server is critical in SET, which holds all the fingerprints?"

A: No. SET needs _something_ that maps chunk fingerprints -> file IDs. In practice, the best way to implement that is to build on top of an existing DHT mechanism. Our implementation can use either OpenDHT or a simple centralized server.

John Casey noted the LBFS citation. Yup - that's exactly the technique we use for splitting files up into chunks in SET so that we can find chunks even when there are insertions or deletions in the file. Our implementation uses their code for Rabin fingerprinting, in fact. As in all things, SET stands on the shoulders of giants - we didn't invent fingerprinting, we're just one of many systems that use it.

Many people discussed whether SET can improve the speed of "legitimate" files. Two answers: a) I hope we'll see a future in which P2P is used to deliver completely legal multimedia content. To some degree, it's already there - lots of indie artists release songs freely already, and there's no reason to think that the same folks who post (legal) videos to YouTube wouldn't distribute those via p2p _if_ the p2p systems were as easy to use. And it still can improve one->many distribution. Imagine that Warner was distributing both the English and German versions of a movie. Our study found that the two files have substantial similarity. SET would allow you to use a BitTorrent-like approach to distributing those, where the two different swarms could draw from each other for the similar content.

b) As many people have already noted, tons of past studies showed that there's substantial similarity in things like different versions of powerpoint presentations, code, software builds, email, and even web pages. We've got some pointers to these previous studies in the paper (http://www.cs.cmu.edu/~dga/dot/). We didn't focus on these in our work because, frankly, most of the bytes transferred via p2p these days _are_ multimedia, and to our knowledge, nobody had looked at the question of similarity between multimedia files.

SET can find the similarity in those files just as easily as it can others. We've tested it on things like Linux ISOs and RPMs, and it still works. A small caveat is that for ISOs, we found that it's much more effective if we reduce the chunk size to 2KB, which adds an unpleasant amount of overhead. (Possible interaction with the media format block size?)

Justin Chapweske noted the relationship to things like Riverbed: The system is very different from Riverbed's. The basic idea of using fingerprinting isn't new, and it's not our contribution at all. The contribution that we're focusing on is using handprinting -- deterministic sampling of the fingerprints -- to efficiently locate _other peers_ who have similar files.

Did I miss anything? Glad that you found the work interesting. Note that we've released all of the source code for SET, but the UI is about what one would expect from an academic research project. :) (A command line with cryptic syntax.) We're happy to provide some guidance if anyone wanted to implement it in one of the popular P2P systems. The handprinting itself would be pretty easy to do in any system that already uses a DHT or other similar key->value lookup system; building on top of Rabin Fingerprinting for the insertion/ deletion robustness would take bigger changes to most existing systems.

  -Dave

Attachment: PGP.sig
Description: This is a digitally signed message part

_______________________________________________
p2p-hackers mailing list
[EMAIL PROTECTED]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to