Re: Why single thread for HDFS?

elton sky Tue, 06 Jul 2010 19:11:42 -0700

Steve,

Seems HP has done block based parallel reading from different datanodes.
Though not from disk level, they achieve 4Gb/s rate with 9 readers (500Mb/s
each).
I didn't see anywhere I can download their code to play around, pity~


BTW, can we specify which disk to read from with Java?

On Wed, Jul 7, 2010 at 1:30 AM, Steve Loughran <[email protected]> wrote:

> Michael Segel wrote:
>
>> Uhm...
>>
>> That's not really true. It gets a bit more complicated than that.
>>
>> If you're talking about M/R jobs, you don't want to do threads in your
>> map() routine, while this is possible, its going to be really hard to
>> justify the extra parallelism along with the need to wait for all of the
>> threads to complete before you can end the map() method.
>> If you're talking about a way to copy files from one cluster to another...
>> in hadoop... you can find out the block lists that make up the file. As long
>> as the file is static, meaning no one is writing/spliting/compacting the
>> file, you could copy it. Here being multi threaded could work. You'd have
>> one thread per block that will read from one machine, and then write
>> directly to the other. Of course you'll need to figure out where to write
>> the block, or rather tie in to HDFS.
>>
>
> There's a paper by Russ Perry using HDFS as a filestore for raster
> processing, where he modified DfsClient to get all the locations of a file,
> and let the caller decide where to read blocks from.
>
> http://www.hpl.hp.com/techreports/2009/HPL-2009-345.html
>
> the advantage of this is that the caller can do the striping across
> machines, keep every server busy by asking for files from each of them. Of
> course, this ignores the trend to many-HDD servers; DfsClient can't
> currently see which physical disk a file is on, which you'd need if the
> client wanted to keep every disk on every server busy during a big read
>

Re: Why single thread for HDFS?

Reply via email to