[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874815#action_12874815
 ] 

Ramkumar Vadali commented on MAPREDUCE-1838:
--------------------------------------------

I had considered sorting the files in decreasing order of size before writing 
to the sequence file. The idea was to assign these files in a round-robin 
manner to each split. But splits are required to be contiguous portions of the 
split file and it looks like we dont know the number of splits when generating 
the split file (is it the same as the number set in jobconf.setNumMapTasks?)

One option is to simply shuffle the list of files before writing to the split 
file. That should help reduce the variance

> DistRaid map tasks have large variance in running times
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-1838
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1838
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/raid
>    Affects Versions: 0.20.1
>            Reporter: Ramkumar Vadali
>            Priority: Minor
>
> HDFS RAID uses map-reduce jobs to generate parity files for a set of source 
> files. Each map task gets a subset of files to operate on. The current code 
> assigns files by walking through the list of files given in the constructor 
> of DistRaid
> The problem is that the list of files given to the constructor has the order 
> of (pretty much) the directory listing. When a large number of files is 
> added, files in that order tend to have the same size. Thus a map task can 
> end up with large files where as another can end up with small files, 
> increasing the variance in run times.
> We could do smarter assignment by using the file sizes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to