[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914286#action_12914286
 ] 

Ramkumar Vadali commented on MAPREDUCE-1819:
--------------------------------------------

The intention of the task is to achieve concurrency between 
RaidNode.selectFiles() and submission of Raid MR jobs. The original solution 
proposed is narrow in scope - it only works when there are quite a few policies 
specified so that a MR job for a policy can run while selectFiles() is running 
for other policies.

But depending on the policy is not a good design. Instead, the following 
proposal is better:

1. For each policy, start a directory traversal as specified by the policy. 
When a certain number of files have been selected for Raiding, save the 
traversal state and submit a MR job.
   - move on to the next policy
2. After looping through all the policies, submit a job for a policy only if 
there are not enough jobs running for it already.

The limit on the number of files and number of running jobs will need to be 
configurable. With this design, if there is a single policy covering a large 
number of files, we can still have sufficient concurrency between selectFiles() 
and running jobs.

I will submit a patch with this design.

> RaidNode should be smarter in submitting Raid jobs
> --------------------------------------------------
>
>                 Key: MAPREDUCE-1819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1819
>             Project: Hadoop Map/Reduce
>          Issue Type: Task
>          Components: contrib/raid
>    Affects Versions: 0.20.1
>            Reporter: Ramkumar Vadali
>
> The RaidNode currently computes parity files as follows:
> 1. Using RaidNode.selectFiles() to figure out what files to raid for a policy
> 2. Using #1 repeatedly for each configured policy to accumulate a list of 
> files. 
> 3. Submitting a mapreduce job with the list of files from #2 using 
> DistRaid.doDistRaid()
> This task addresses the fact that #2 and #3 happen sequentially. The proposal 
> is to submit a separate mapreduce job for the list of files for each policy 
> and use another thread to track the progress of the submitted jobs. This will 
> help reduce the time taken for files to be raided.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to