On Thu, Apr 23, 2009 at 1:50 PM, Rufus Pollock <[email protected]> wrote: > 2009/4/23 Luis Villa <[email protected]>: >> On Thu, Apr 23, 2009 at 10:29 AM, <[email protected]> wrote: >>> Quoting Lukasz Szybalski <[email protected]>: >>>> I guess the question would be: Could you describe the >>> type of data you >>>> currently have. (percentage of space, downloads, changes) >>> >>> This is the directory that has broken the system (watch >>> out-- it may break your browser): >>> >>> http://ukparse.kforge.net/svn/undata/pdf/ >>> >>> It's several thousand large PDFs of UN documents. The >>> same would apply to scanned images, archived pages from >>> Hansard, etc. > > [snip] > >> The traditional way to handle data sets like this is a combination of >> http/ftp mirroring; you might ask the fedora people if their >> mirrormanager code is available. That is very complicated, though- it >> relies on either active user participation ('select the mirror closest >> to you') or a variety of other tricks ('we'll try to guess the mirror >> closest to you') to select mirrors, and requires a combination of >> software and human screening to monitor whether or not the mirror is >> actually active, uncorrupted, etc. That said it has worked well for 15 >> years for Linux distros. > > This seems to be more oriented to solving the download bandwidth > problem. While this might become an issue at some point I think the > first problem is a storage one. > >> I think the more forward-thinking way to go is to use bittorrent + >> some sort of simple script to encourage mirrors to add new files as >> they are created (i.e., cron + rss + command-line torrent client.) BT >> is wildly inefficient for files of this size but is the only >> widely-available/widely-understood p2p tool, handles automagically all >> the hard parts of ftp/http mirroring (except regularly adding new >> files) and is, I think, more ideologically appropriate for anyone >> interested in creating a real knowledge commons than a centralized >> tool like ftp/http. > > Like you I thought about BT a lot when this problem first came up. The > problem with BT is that it is oriented to solving the b/w problem and > it isn't a distributed file store. In particular: > > * No way to do chunking (BT will chunk automatically the file when > doing its download/upload but no way to just get a node to keep only a > part of a file) > > * No way to allocate chunks to nodes (each client decides what files > it is going to hold) > > * (Associatedly) Replication of chunks is not built in > > * Poor node persistence (BT is oriented to systems where users enter > and exit rapidly rather than one where nodes are persistent). > > That said it might well be possible to build some of this > infrastructure on top of BT but as it stands it would seem to be quite > a task. Does anyone know of anyone who has built a distributed storage > system on top of BT?
Ah, I see how I misunderstood the question. I'm afraid I don't have any constructive suggestions, sorry. >> (I long for the day when my home network regularly serves up several >> gigs of purely legal torrented files every day, reducing the load on >> community projects I care about. And I wouldn't mind being the first >> one to have my cable company try to shut me off for it. That'd be all >> kinds of fun. :) > > Great. That means if we get something together we've already got one > volunteer node :) When I'm back from bar exam/wedding/honeymoon (i.e., probably October/November), absolutely. ;) Luis _______________________________________________ okfn-discuss mailing list [email protected] http://lists.okfn.org/cgi-bin/mailman/listinfo/okfn-discuss
