[
https://issues.apache.org/jira/browse/HADOOP-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590972#action_12590972
]
Doug Cutting commented on HADOOP-3288:
--------------------------------------
The HDFS client is currently optimized for the case where the number of readers
is greater than or equal to the number of datanodes and drives. In this case
speculatively fetching the same block from multiple datanodes or speculative
read-ahead will probably reduce overall throughput.
If however the number of readers is substantially smaller than the number of
datanodes and drives, and access is known to be sequential, then speculatively
pre-fetching blocks could speed things some. This could perhaps be
accomplished by adding a read-ahead mode to the client. Is that what you have
in mind?
> Serial streaming performance should be Math.min(ideal client performance,
> ideal serial hdfs performance)
> --------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-3288
> URL: https://issues.apache.org/jira/browse/HADOOP-3288
> Project: Hadoop Core
> Issue Type: Improvement
> Components: dfs
> Affects Versions: 0.16.3, 0.18.0
> Environment: Mac OS X 10.5.2, Java 6
> Reporter: Sam Pullara
> Fix For: 0.18.0
>
>
> I looked at all the code long and hard and this was my analysis (could be
> wrong, I'm not an expert on this codebase):
> Current Serial HDFS performance = Average Datanode Performance
> Average Datanode Performance = Average Disk Performance (even if you have
> more than one)
> We should have:
> Ideal Serial HDFS Performance = Sum of Ideal Datanode Performance
> Ideal Datanode Performance = Sum of disk performance
> When you read a single file serially from HDFS there are a number of
> limitations that come into play:
> 1) Blocks on multiple datanodes will be load balanced between them -
> averaging the performance of the datanodes
> 2) Blocks on multiple disks in a single datanode are load balanced between
> them - averaging the performance of the disks
> I think that all this could be fixed if we actually prefetched fully read
> blocks on the client until the client can no longer keep up with the data or
> there is another bottleneck like network bandwidth.
> This seems like a reasonably common use case though not the typical MapReduce
> case.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.