[ 
https://issues.apache.org/jira/browse/CRUNCH-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442864#comment-13442864
 ] 

Josh Wills commented on CRUNCH-50:
----------------------------------

Love it-- we'll integrate the test when we integrate CRUNCH-34. Thanks Ryan!
                
> Map-only parallelDos with multiple outputs should be fused
> ----------------------------------------------------------
>
>                 Key: CRUNCH-50
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-50
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Ryan Brush
>         Attachments: MULTIOUTPUT_FUSE_TEST.patch
>
>
> I'm not sure if this is a bug or just an optimization yet to be implemented, 
> but sibling parallelDo operations that have separate outputs are not being 
> fused into a single Map operation.  (The original FlumeJava paper suggests 
> they should be, and it would be a nice optimization to have here as well.)  
> Instead, each parallelDo results in a separate Map-only job that 
> (redundantly) scans the input source of data.
> This can be seen in the current MultipleOutputIT integration test.  Notice 
> the logs below from running one of those tests scans the same input in 
> multiple jobs.
> 8414 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running 
> job "org.apache.crunch.MultipleOutputIT: 
> Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+even+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/even)"
> 8415 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status 
> available at: http://localhost:8080/
> 8417 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - Use 
> GenericOptionsParser for parsing the arguments. Applications should implement 
> Tool for the same.
> 8497 [Thread-38] WARN  org.apache.hadoop.mapred.JobClient  - No job jar file 
> set.  User classes may not be found. See JobConf(Class) or 
> JobConf#setJar(String).
> 8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Running 
> job "org.apache.crunch.MultipleOutputIT: 
> Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/letters.txt)+odd+asText+Text(/var/folders/jd/4yr3f9m15kn7mz7h3gz3ysb40000gp/T/junit892676812962236999/odd)"
> 8532 [Thread-38] INFO  org.apache.crunch.impl.mr.exec.CrunchJob  - Job status 
> available at: http://localhost:8080/
> I was going to take a stab at a patch for this, but noticed some major 
> refactoring in this space is on deck as part of CRUNCH-34...so it might be 
> best to address this after CRUNCH-34 lands.
> As an aside, it wasn't clear how to write a good integration test to expose 
> this functionality.  Would simply counting the stage results and ensuring we 
> have the expected number for a simple job be the best way?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to