[ https://issues.apache.org/jira/browse/CRUNCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Wills updated CRUNCH-509: ------------------------------ Attachment: CRUNCH-509b.patch Got a version of this to work, but it's interesting in a couple of ways. First, I had to eliminate some _seriously_ legacy bits of Crunch's AvroOutputFormat that was written in the days before multiple outputs were really supported well and that was causing the page rank-related test failures we were getting when running these tests. I felt a little weird doing it, but removing those bits broke no tests, and the approach of a different named schema param for each avro output was outmoded anyway. Second, I'm basically passing in a known "name" value for each output to the configureForMapReduce function, and then immediately pulling out all of its output config info and using it to configure the Job I create for Spark. Since Spark only writes one output at a time, this works fine, even though it looks hacky. I think it would be interesting to try creating Spark pipelines that had something closer to "real" support for multiple outputs, but I think that will take some substantial work, and I can live with this for now. [~mkwhitacre] and [~gabriel.reid], thoughts on this approach are welcome. > Crunch with Spark doesn't name all outputs > ------------------------------------------ > > Key: CRUNCH-509 > URL: https://issues.apache.org/jira/browse/CRUNCH-509 > Project: Crunch > Issue Type: Bug > Components: Core > Affects Versions: 0.11.0 > Reporter: Micah Whitacre > Assignee: Josh Wills > Fix For: 0.12.0 > > Attachments: CRUNCH-509.patch, CRUNCH-509b.patch > > > Crunch currently does not "name" all outputs when running with a > SparkPipeline. This becomes a problem as some Targets (based on CRUNCH-82) > have coded in checked to ensure that the name must be populated. > Specifically the implementation I'm running into issues with is the Kite > DatasetTarget[2]. > Need to read up a bit on context to see if it is a Crunch/Kite issue or where > it is easiest/correct to fix. [~jwills] or [~tomwhite] feedback would be > welcome. > [1] - > https://github.com/apache/crunch/blob/3ab0b078c47f23b3ba893fdfb05fd723f663d02b/crunch-spark/src/main/java/org/apache/crunch/impl/spark/SparkRuntime.java#L337 > [2] - > https://github.com/kite-sdk/kite/blob/e080f0237e7383a16fff8547ad43387ccf55c473/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/DatasetTarget.java#L178 -- This message was sent by Atlassian JIRA (v6.3.4#6332)