> - 1 replica / 1 slave case writes at 15MB/sec. This seems to point
> the performance problem to how datanode writes data (even to itself).
On Hadoop, most of the delay you are seeing for 1 replica test with one
node, is because of this: It first writes 64MB to local tmp file, then
it sends that 64MB file over (local) ethernet to DataNode on the same
node before starting to write next 64MB. Writing to tmp file and sending
to DataNode is *not* pipelined.
Disk b/w is not always equal to raw serial read/write bandwidth you get
on a fresh partition with large disk (In fact 75MBps sounds pretty large
what kind of disk is it? Is it a raid? or 10K rpm disk?)
I would suggest a simple exercise: Write 20GB file with dd as you
initially did that gave you 75 MBps. Now read this file and write
another 20GB at the same time. Do you see 38MBps for each of read and
write? You mostly won't. Where is the missing inefficiency? You could
repeat this on a partition that is 80% full. There more factors that
affect disk performance other than raw serial read/write b/w. Most
important of them being disk seeks.
This is not Hadoop related and Hadoop inefficiencies are not necessarily
for the same reason.
Also, 30MBps you tested your network is most likely limited by ssh
processing in scp than the b/w of the network. How can you confirm it?
Raghu.
Bwolen Yang wrote:
Raghu,
The 1 replica and "du" suggestions are good. thank you.
To further reduce the variables, I also tried 1 replica/1 slave case.
(namenode and jobtracker are still on their own machines.)
- randomwriter:
- 1 replica / 1 slave case writes at 15MB/sec. This seems to point
the performance problem to how datanode writes data (even to itself).
- 1 replica / 5 slave case's running time is 1/4th of 3 replica
case. Perfect scaling would have been 1/3rd. So, there is a 33%
additional performance overhead lost to replication (beyond writing 3x
as much data).
- Looks like every 5GB data I put into Hadoop DFS, it uses up ~18GB....
Turned out there are a few blocks that are only a few k. "du" is the
right tool. The actual raw disk overhead only 1%. thanks.
You are asuming each block is is 64M. There are some blocks for "CRC
files". Did you try to du the datanode's 'data directories'?
All blk_* files are 64MB or less.
However, some mappers still show it is accessing
part-0:1006632960+70663780
where 70663780 is about 67MB. Hmm... looks like it is only doing so
at the last block. I guess that's not too bad.
They are pipelined.
you're right :). the slowness exists even in single slave / single
replica case.
thanks
bwolen