Re: Why single thread for HDFS?

Steve Loughran Wed, 07 Jul 2010 02:46:24 -0700

elton sky wrote:

Steve,


Seems HP has done block based parallel reading from different datanodes.

yes; very much like IBM's GPFS, only with JBOD storage and the option ofrunning code near the data when appropriate.

Though not from disk level, they achieve 4Gb/s rate with 9 readers (500Mb/s
each).
I didn't see anywhere I can download their code to play around, pity~

I do have access to that code if I can get at the right bit of therepository, if you really want me to look at it in detail ask, with thecaveats that I'm away for the rest of the month and somewhat busy. Apartfrom that there's no reason why I shouldn't be able to make the changesto DfsClient public. Keep reminding me :)

BTW, can we specify which disk to read from with Java?

I think right now you get a list of blocks viaDfsClient.getBlockLocations(); this is a list of hosts where blockslive. There is no data about which disk on the specific host.

I belive that what Russ did was move the decisions from DfsInputStream-which picks a block location for you, with a bias to the local host-and instead lets the calling program make the decision as to where tofetch each block. This meant he could set the renderer up to requestblocks from different hosts.

He had tried to use the JT to schedule the rendering code, but thatdidn't work as MapReduce has the notion of "reduction": less data outthan in, so it moves work to where the data is. In rendering it's moreMapExpand; the operation is the transformation of PDF pages into 600dpi32bpp bitmaps, which then need to be streamed to the (very large)printer at its print rate, in the correct order. It was easiest to havea specific machine on the cluster -with no datanodes or TTs- set up todo the rendering, and just ask the filesystem for where things are.

Like I said, I don't think there was anything tricky done in DfsClient,more a matter of making some data known internally to the DfsClientcode public, so that the client app can decide where to fetch data. Ifthe DfsClient knew which HDD the data was on in a datanode, the clientapp could use that in its decision making too, so that if the 9 machineseach had 6 HDDs, you could keep them all busy.

Re: Why single thread for HDFS?

Reply via email to