[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621095#comment-13621095
 ] 

Todd Lipcon commented on MAPREDUCE-5125:
----------------------------------------

Instead, we should allocate a buffer of 1MB or so random data (larger than the 
typical compression window of LZ-based algorithms) and stripe that buffer into 
the output file. This will make the benchmark results more representative of 
typical workloads where the data being written has already been compressed at 
the file format level.
                
> TestDFSIO should write less compressible data
> ---------------------------------------------
>
>                 Key: MAPREDUCE-5125
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5125
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.0.3-alpha, 1.1.2
>            Reporter: Todd Lipcon
>            Priority: Minor
>
> Currently, TestDFSIO writes a short repeating string of sequential (byte)0 
> through (byte)50. This makes its output very compressible (I measured 250:1 
> by LZOing the resulting file). This makes the results of TestDFSIO very hard 
> to compare when running on HDFS vs other file systems which may include some 
> compression on the network, disk, or both -- what is ostensibly a benchmark 
> of IO throughput yields completely skewed results towards the system with 
> compression.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to