[ 
https://issues.apache.org/jira/browse/HADOOP-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719350#action_12719350
 ] 

Philip Zeyliger commented on HADOOP-6039:
-----------------------------------------

The motivation behind computing the input splits on the cluster is at least 
two-fold:
 * It would be great to be able to submit jobs to a cluster using a simple 
(REST?) API, from many languages.  (Similar to HADOOP-5633.)  The fact that job 
submission does a bunch of mapreduce-internal work makes such submission very 
tricky.  We're already seeing how workflow systems (here I'm thinking of Oozie 
and Pig) run MR jobs simply to launch more MR jobs, while inheriting the 
scheduling and isolation work that the JobTracker already does.
 * Sometimes computing the input splits is, in of itself, an operation that 
would do well to be run in parallel across several machines.  For example, 
splitting inputs may require going through many files on the DFS.  Moving input 
split calculations onto the cluster would pave the way for this to be possible.

Implementation-wise, we already have JOB_SETUP and JOB_CLEANUP tasks, so adding 
a JOB_SPLIT_CALCULATION, which could be colocated with JOB_SETUP makes some 
sense.

> Computing Input Splits on the MR Cluster
> ----------------------------------------
>
>                 Key: HADOOP-6039
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6039
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Philip Zeyliger
>
> Instead of computing the input splits as part of job submission, Hadoop could 
> have a separate "job task type" that computes the input splits, therefore 
> allowing that computation to happen on the cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to