Raghu,

The 1 replica and "du" suggestions are good.  thank you.

To further reduce the variables, I also tried 1 replica/1 slave case.
(namenode and jobtracker are still on their own machines.)

- randomwriter:
 - 1 replica / 1 slave case writes at 15MB/sec.  This seems to point
the performance problem to how datanode writes data (even to itself).

 - 1 replica / 5 slave case's running time is 1/4th of 3 replica
case.  Perfect scaling would have been 1/3rd.  So, there is a 33%
additional performance overhead lost to replication (beyond writing 3x
as much data).


 - Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB....

Turned out there are a few blocks that are only a few k.  "du" is the
right tool.  The actual raw disk overhead only 1%.  thanks.


You are asuming each block is is 64M. There are some blocks for "CRC
files". Did you try to du the datanode's 'data directories'?

All blk_* files are 64MB or less.

However, some mappers still show it is accessing
               part-0:1006632960+70663780
where 70663780 is about 67MB.   Hmm... looks like it is only doing so
at the last block.  I guess that's not too bad.


They are pipelined.

you're right :).   the slowness exists even in single slave / single
replica case.

thanks

bwolen

Reply via email to