[
https://issues.apache.org/jira/browse/CRUNCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Wills updated CRUNCH-597:
------------------------------
Attachment: CRUNCH-597.patch
Patch for this, which ended up being a bit more work than I had initially
hoped. Without any objections to the upgrade (I think it's fair to say that
it's time, and that it's okay to do this for 0.14.0, but LMK if I'm wrong about
that), I'll commit this tomorrow.
> Unable to process parquet files using Hadoop
> --------------------------------------------
>
> Key: CRUNCH-597
> URL: https://issues.apache.org/jira/browse/CRUNCH-597
> Project: Crunch
> Issue Type: Bug
> Components: Core, IO
> Affects Versions: 0.13.0
> Reporter: Stan Rosenberg
> Assignee: Josh Wills
> Attachments: CRUNCH-597.patch
>
>
> Current version of parquet-hadoop results in the following stack trace while
> attempting to read from parquet file.
> {code}
> java.lang.Exception: java.lang.ClassCastException:
> org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to
> parquet.hadoop.ParquetInputSplit
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
> Caused by: java.lang.ClassCastException:
> org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to
> parquet.hadoop.ParquetInputSplit
> at
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
> at
> org.apache.crunch.impl.mr.run.CrunchRecordReader.initialize(CrunchRecordReader.java:140)
> at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Here is the relevant code snippet which yields the above stack trace when
> executed locally,
> {code}
> Pipeline pipeline = new MRPipeline(Crunch.class, conf);
> PCollection<Pair<String, Observation>> observations =
> pipeline.read(AvroParquetFileSource.builder(record).build(new
> Path(args[0])))
> .parallelDo(new TranslateFn(),
> Avros.tableOf(Avros.strings(), Avros.specifics(Observation.class)));
> for (Pair<String, Observation> pair : observations.materialize()) {
> System.out.println(pair.second());
> }
> PipelineResult result = pipeline.done();
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)