[ https://issues.apache.org/jira/browse/CRUNCH-509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Micah Whitacre updated CRUNCH-509: ---------------------------------- Attachment: CRUNCH-509.patch Still working on solution for this. The change to add name support is pretty simple. The downstream effect however is that all calls to materialize the output (which is what we do in the IT for Spark) fail because it cannot find the files. {noformat} 4500 [Thread-29] INFO org.apache.spark.scheduler.DAGScheduler - Job 0 finished: saveAsNewAPIHadoopFile at SparkRuntime.java:332, took 0.874098 s 15/04/08 20:57:48 INFO DAGScheduler: Job 0 finished: saveAsNewAPIHadoopFile at SparkRuntime.java:332, took 0.874098 s 4573 [main] INFO org.apache.crunch.io.avro.AvroFileReaderFactory - Could not read avro file at path: file:/tmp/crunch-109470525/p1/part-r-00000 java.io.IOException: Not a data file. at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105) at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97) at org.apache.crunch.io.avro.AvroFileReaderFactory.read(AvroFileReaderFactory.java:74) at org.apache.crunch.io.CompositePathIterable$2.<init>(CompositePathIterable.java:87) at org.apache.crunch.io.CompositePathIterable.iterator(CompositePathIterable.java:85) at com.google.common.collect.Iterables$3.next(Iterables.java:512) at com.google.common.collect.Iterables$3.next(Iterables.java:505) at com.google.common.collect.Iterators$5.hasNext(Iterators.java:597) at org.apache.crunch.materialize.pobject.FirstElementPObject.process(FirstElementPObject.java:45) at org.apache.crunch.materialize.pobject.PObjectImpl.getValue(PObjectImpl.java:71) at org.apache.crunch.SparkPageRankIT.run(SparkPageRankIT.java:156) at org.apache.crunch.SparkPageRankIT.testAvroReflects(SparkPageRankIT.java:97) {noformat} One of the behavior changes I noticed is that when ran without a name, the job produces files that are named, part-r-00000.avro. When we add the name we are now getting files without the file extension. I believe this might be related to it not being able to detect the files as containing data but I haven't found in the code where that extension might be getting dropped. > Crunch with Spark doesn't name all outputs > ------------------------------------------ > > Key: CRUNCH-509 > URL: https://issues.apache.org/jira/browse/CRUNCH-509 > Project: Crunch > Issue Type: Bug > Components: Core > Affects Versions: 0.11.0 > Reporter: Micah Whitacre > Assignee: Josh Wills > Fix For: 0.12.0 > > Attachments: CRUNCH-509.patch > > > Crunch currently does not "name" all outputs when running with a > SparkPipeline. This becomes a problem as some Targets (based on CRUNCH-82) > have coded in checked to ensure that the name must be populated. > Specifically the implementation I'm running into issues with is the Kite > DatasetTarget[2]. > Need to read up a bit on context to see if it is a Crunch/Kite issue or where > it is easiest/correct to fix. [~jwills] or [~tomwhite] feedback would be > welcome. > [1] - > https://github.com/apache/crunch/blob/3ab0b078c47f23b3ba893fdfb05fd723f663d02b/crunch-spark/src/main/java/org/apache/crunch/impl/spark/SparkRuntime.java#L337 > [2] - > https://github.com/kite-sdk/kite/blob/e080f0237e7383a16fff8547ad43387ccf55c473/kite-data/kite-data-crunch/src/main/java/org/kitesdk/data/crunch/DatasetTarget.java#L178 -- This message was sent by Atlassian JIRA (v6.3.4#6332)