On Friday, April 13, 2007 David Andersen wrote:
> I'm not on the list long-term, but will hang out for a sec if there
> are questions.

        Actually, there are. Thank you for visiting!

        The biggest issue that I've seen mentioned so far seems to 
be this one: how much of the MP3 similarity is due to the different
metadata in the otherwise identical files?

        Did you try comparing your aproach to the one where the MP3 file
hash is calculated without the metadata, on the data block alone? If
the files with the same data block and different metadata would be
considered by a P2P system to be the same file (as they really should
be - just like the identical files with different names), what will
happen to the transfer speed performance improvement numbers quoted 
in the article?

        Sorry if I missed something in the article and that is exactly 
how you came up with these speed improvement numbers to begin with.
But what I'm getting to, I'm trying to figure out how much of that
speed increase can be gained 'easily' - without any complicated code
changes, just by switching to the proper hashing technique.

        Best wishes -
        S.Osokine.
        13 Apr 2007.


-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of David Andersen
Sent: Friday, April 13, 2007 9:31 PM
To: [EMAIL PROTECTED]
Subject: Re: [p2p-hackers] Computer scientists develop P2P system
thatpromises faster music, movie downloads


Mike Freedman pointed out that SET caught a bit of interest here, so  
I figured I'd answer a few of the questions that've come up.  (I'm  
not on the list long-term, but will hang out for a sec if there are  
questions.)

He Yuan asked:  "I think a central index server is critical in SET,  
which holds all the fingerprints?"

A:  No.  SET needs _something_ that maps chunk fingerprints -> file  
IDs.  In practice, the best way to implement that is to build on top  
of an existing DHT mechanism.  Our implementation can use either  
OpenDHT or a simple centralized server.

John Casey noted the LBFS citation.  Yup - that's exactly the  
technique we use for splitting files up into chunks in SET so that we  
can find chunks even when there are insertions or deletions in the  
file.  Our implementation uses their code for Rabin fingerprinting,  
in fact.  As in all things, SET stands on the shoulders of giants -  
we didn't invent fingerprinting, we're just one of many systems that  
use it.

Many people discussed whether SET can improve the speed of  
"legitimate" files.  Two answers:
   a)  I hope we'll see a future in which P2P is used to deliver  
completely legal multimedia content.  To some degree, it's already  
there - lots of indie artists release songs freely already, and  
there's no reason to think that the same folks who post (legal)  
videos to YouTube wouldn't distribute those via p2p _if_ the p2p  
systems were as easy to use.  And it still can improve one->many  
distribution.  Imagine that Warner was distributing both the English  
and German versions of a movie.  Our study found that the two files  
have substantial similarity.  SET would allow you to use a   
BitTorrent-like approach to distributing those, where the two  
different swarms could draw from each other for the similar content.

   b)  As many people have already noted, tons of past studies showed  
that there's substantial similarity in things like different versions  
of powerpoint presentations, code, software builds, email, and even  
web pages.  We've got some pointers to these previous studies in the  
paper (http://www.cs.cmu.edu/~dga/dot/).  We didn't focus on these in  
our work because, frankly, most of the bytes transferred via p2p  
these days _are_ multimedia, and to our knowledge, nobody had looked  
at the question of similarity between multimedia files.

SET can find the similarity in those files just as easily as it can  
others.  We've tested it on things like Linux ISOs and RPMs, and it  
still works.  A small caveat is that for ISOs, we found that it's  
much more effective if we reduce the chunk size to 2KB, which adds an  
unpleasant amount of overhead.  (Possible interaction with the media  
format block size?)

Justin Chapweske noted the relationship to things like Riverbed:  The  
system is very different from Riverbed's.  The basic idea of using  
fingerprinting isn't new, and it's not our contribution at all.  The  
contribution that we're focusing on is using handprinting --  
deterministic sampling of the fingerprints -- to efficiently locate  
_other peers_ who have similar files.

Did I miss anything?  Glad that you found the work interesting.  Note  
that we've released all of the source code for SET, but the UI is  
about what one would expect from an academic research project. :)  (A  
command line with cryptic syntax.)  We're happy to provide some  
guidance if anyone wanted to implement it in one of the popular P2P  
systems.  The handprinting itself would be pretty easy to do in any  
system that already uses a DHT or other similar key->value lookup  
system;  building on top of Rabin Fingerprinting for the insertion/ 
deletion robustness would take bigger changes to most existing systems.

   -Dave


_______________________________________________
p2p-hackers mailing list
[EMAIL PROTECTED]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to