Serial streaming performance should be Math.min(ideal client performance, ideal
serial hdfs performance)
--------------------------------------------------------------------------------------------------------
Key: HADOOP-3288
URL: https://issues.apache.org/jira/browse/HADOOP-3288
Project: Hadoop Core
Issue Type: Improvement
Components: dfs
Affects Versions: 0.16.3, 0.18.0
Environment: Mac OS X 10.5.2, Java 6
Reporter: Sam Pullara
Fix For: 0.18.0
I looked at all the code long and hard and this was my analysis (could be
wrong, I'm not an expert on this codebase):
Current Serial HDFS performance = Average Datanode Performance
Average Datanode Performance = Average Disk Performance (even if you have more
than one)
We should have:
Ideal Serial HDFS Performance = Sum of Ideal Datanode Performance
Ideal Datanode Performance = Sum of disk performance
When you read a single file serially from HDFS there are a number of
limitations that come into play:
1) Blocks on multiple datanodes will be load balanced between them - averaging
the performance of the datanodes
2) Blocks on multiple disks in a single datanode are load balanced between them
- averaging the performance of the disks
I think that all this could be fixed if we actually prefetched fully read
blocks on the client until the client can no longer keep up with the data or
there is another bottleneck like network bandwidth.
This seems like a reasonably common use case though not the typical MapReduce
case.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.