Serial streaming performance should be Math.min(ideal client performance, ideal 
serial hdfs performance)
--------------------------------------------------------------------------------------------------------

                 Key: HADOOP-3288
                 URL: https://issues.apache.org/jira/browse/HADOOP-3288
             Project: Hadoop Core
          Issue Type: Improvement
          Components: dfs
    Affects Versions: 0.16.3, 0.18.0
         Environment: Mac OS X  10.5.2, Java 6
            Reporter: Sam Pullara
             Fix For: 0.18.0


I looked at all the code long and hard and this was my analysis (could be 
wrong, I'm not an expert on this codebase):

Current Serial HDFS performance = Average Datanode Performance
Average Datanode Performance = Average Disk Performance (even if you have more 
than one)

We should have:

Ideal Serial HDFS Performance = Sum of Ideal Datanode Performance
Ideal Datanode Performance = Sum of disk performance

When you read a single file serially from HDFS there are a number of 
limitations that come into play:

1) Blocks on multiple datanodes will be load balanced between them - averaging 
the performance of the datanodes
2) Blocks on multiple disks in a single datanode are load balanced between them 
- averaging the performance of the disks

I think that all this could be fixed if we actually prefetched fully read 
blocks on the client until the client can no longer keep up with the data or 
there is another bottleneck like network bandwidth.

This seems like a reasonably common use case though not the typical MapReduce 
case.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to