[ 
https://issues.apache.org/jira/browse/HDFS-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896997#action_12896997
 ] 

Hong Tang commented on HDFS-1338:
---------------------------------

I think the goal of TestDFSIO is to benchmark the peak HDFS throughput under 
typical MR usage pattern. This means:
- Files should be replicated.
- Files should be spread to nodes relatively evenly. (Run one map per node on 
the cluster, and writes out data evenly.)
- Locality information should be exposed to the MR framework correctly. (Should 
just use FileInputFormat instead of writing a side file.)
- The amount of dataset should not fit in OS buffer cache. (Configure the 
benchmark such that total amount of data > total RAM).
- Throughput should be aggregated as a time series and we should ignore the 
ramp up and cool down phase of the execution. (Output of each map should be 
time series of counters of bytes read so far. The reporting may calculate the 
max and average of the mid-1/3 of the time series).
- We should minimize the variations of MR scheduling. (Run one wave of maps, 
increase block size so that each map runs in at least 20 to 30 seconds). 

> Improve TestDFSIO
> -----------------
>
>                 Key: HDFS-1338
>                 URL: https://issues.apache.org/jira/browse/HDFS-1338
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>
> Currently the read test in TestDFSIO benchmark just opens a large side file 
> and measures the read performance. The MR scheduler has no opportunity to do 
> *any* optimization for the TestDFSIO MR application. The side-effect of this 
> is that it is *very* hard to do any meaningful analysis of the results of the 
> benchmark i.e. to check if node-local or rack-local or off-switch read 
> performance improved/degraded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to