>  - 1 replica / 1 slave case writes at 15MB/sec.  This seems to point
> the performance problem to how datanode writes data (even to itself).

On Hadoop, most of the delay you are seeing for 1 replica test with one node, is because of this: It first writes 64MB to local tmp file, then it sends that 64MB file over (local) ethernet to DataNode on the same node before starting to write next 64MB. Writing to tmp file and sending to DataNode is *not* pipelined.

Disk b/w is not always equal to raw serial read/write bandwidth you get on a fresh partition with large disk (In fact 75MBps sounds pretty large what kind of disk is it? Is it a raid? or 10K rpm disk?)

I would suggest a simple exercise: Write 20GB file with dd as you initially did that gave you 75 MBps. Now read this file and write another 20GB at the same time. Do you see 38MBps for each of read and write? You mostly won't. Where is the missing inefficiency? You could repeat this on a partition that is 80% full. There more factors that affect disk performance other than raw serial read/write b/w. Most important of them being disk seeks.

This is not Hadoop related and Hadoop inefficiencies are not necessarily for the same reason.

Also, 30MBps you tested your network is most likely limited by ssh processing in scp than the b/w of the network. How can you confirm it?

Raghu.

Bwolen Yang wrote:
Raghu,

The 1 replica and "du" suggestions are good.  thank you.

To further reduce the variables, I also tried 1 replica/1 slave case.
(namenode and jobtracker are still on their own machines.)

- randomwriter:
 - 1 replica / 1 slave case writes at 15MB/sec.  This seems to point
the performance problem to how datanode writes data (even to itself).

 - 1 replica / 5 slave case's running time is 1/4th of 3 replica
case.  Perfect scaling would have been 1/3rd.  So, there is a 33%
additional performance overhead lost to replication (beyond writing 3x
as much data).


 - Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB....

Turned out there are a few blocks that are only a few k.  "du" is the
right tool.  The actual raw disk overhead only 1%.  thanks.


You are asuming each block is is 64M. There are some blocks for "CRC
files". Did you try to du the datanode's 'data directories'?

All blk_* files are 64MB or less.

However, some mappers still show it is accessing
               part-0:1006632960+70663780
where 70663780 is about 67MB.   Hmm... looks like it is only doing so
at the last block.  I guess that's not too bad.


They are pipelined.

you're right :).   the slowness exists even in single slave / single
replica case.

thanks

bwolen

Reply via email to