Steve, Seems HP has done block based parallel reading from different datanodes. Though not from disk level, they achieve 4Gb/s rate with 9 readers (500Mb/s each). I didn't see anywhere I can download their code to play around, pity~
BTW, can we specify which disk to read from with Java? On Wed, Jul 7, 2010 at 1:30 AM, Steve Loughran <[email protected]> wrote: > Michael Segel wrote: > >> Uhm... >> >> That's not really true. It gets a bit more complicated than that. >> >> If you're talking about M/R jobs, you don't want to do threads in your >> map() routine, while this is possible, its going to be really hard to >> justify the extra parallelism along with the need to wait for all of the >> threads to complete before you can end the map() method. >> If you're talking about a way to copy files from one cluster to another... >> in hadoop... you can find out the block lists that make up the file. As long >> as the file is static, meaning no one is writing/spliting/compacting the >> file, you could copy it. Here being multi threaded could work. You'd have >> one thread per block that will read from one machine, and then write >> directly to the other. Of course you'll need to figure out where to write >> the block, or rather tie in to HDFS. >> > > There's a paper by Russ Perry using HDFS as a filestore for raster > processing, where he modified DfsClient to get all the locations of a file, > and let the caller decide where to read blocks from. > > http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html > > the advantage of this is that the caller can do the striping across > machines, keep every server busy by asking for files from each of them. Of > course, this ignores the trend to many-HDD servers; DfsClient can't > currently see which physical disk a file is on, which you'd need if the > client wanted to keep every disk on every server busy during a big read >
