[ https://issues.apache.org/jira/browse/HADOOP-6039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719350#action_12719350 ]
Philip Zeyliger commented on HADOOP-6039: ----------------------------------------- The motivation behind computing the input splits on the cluster is at least two-fold: * It would be great to be able to submit jobs to a cluster using a simple (REST?) API, from many languages. (Similar to HADOOP-5633.) The fact that job submission does a bunch of mapreduce-internal work makes such submission very tricky. We're already seeing how workflow systems (here I'm thinking of Oozie and Pig) run MR jobs simply to launch more MR jobs, while inheriting the scheduling and isolation work that the JobTracker already does. * Sometimes computing the input splits is, in of itself, an operation that would do well to be run in parallel across several machines. For example, splitting inputs may require going through many files on the DFS. Moving input split calculations onto the cluster would pave the way for this to be possible. Implementation-wise, we already have JOB_SETUP and JOB_CLEANUP tasks, so adding a JOB_SPLIT_CALCULATION, which could be colocated with JOB_SETUP makes some sense. > Computing Input Splits on the MR Cluster > ---------------------------------------- > > Key: HADOOP-6039 > URL: https://issues.apache.org/jira/browse/HADOOP-6039 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Reporter: Philip Zeyliger > > Instead of computing the input splits as part of job submission, Hadoop could > have a separate "job task type" that computes the input splits, therefore > allowing that computation to happen on the cluster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.