[jira] Commented: (HADOOP-3288) Serial streaming performance should be Math.min(ideal client performance, ideal serial hdfs performance)

Raghu Angadi (JIRA) Wed, 23 Apr 2008 09:51:00 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12591679#action_12591679
 ]


Raghu Angadi commented on HADOOP-3288:
--------------------------------------

> The first negative (lower cluster throughput) definitely would, but that's 
> the point: instead of always paying that penalty, as you would with RAID 0, 
> an optional read-ahead feature would let clients declare when latency should 
> be prioritized ahead of throughput.

For that this decision has to be made while writing since thats when striping 
is done. Once a block is striped, the read speed/slowdown would be on par with 
RAID0 irrespective of read-ahead, it looks like to me.

> Serial streaming performance should be Math.min(ideal client performance, 
> ideal serial hdfs performance)
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3288
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3288
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.16.3, 0.18.0
>         Environment: Mac OS X  10.5.2, Java 6
>            Reporter: Sam Pullara
>             Fix For: 0.18.0
>
>
> I looked at all the code long and hard and this was my analysis (could be 
> wrong, I'm not an expert on this codebase):
> Current Serial HDFS performance = Average Datanode Performance
> Average Datanode Performance = Average Disk Performance (even if you have 
> more than one)
> We should have:
> Ideal Serial HDFS Performance = Sum of Ideal Datanode Performance
> Ideal Datanode Performance = Sum of disk performance
> When you read a single file serially from HDFS there are a number of 
> limitations that come into play:
> 1) Blocks on multiple datanodes will be load balanced between them - 
> averaging the performance of the datanodes
> 2) Blocks on multiple disks in a single datanode are load balanced between 
> them - averaging the performance of the disks
> I think that all this could be fixed if we actually prefetched fully read 
> blocks on the client until the client can no longer keep up with the data or 
> there is another bottleneck like network bandwidth.
> This seems like a reasonably common use case though not the typical MapReduce 
> case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3288) Serial streaming performance should be Math.min(ideal client performance, ideal serial hdfs performance)

Reply via email to