[jira] [Updated] (CRUNCH-597) Unable to process parquet files using Hadoop

Josh Wills (JIRA) Thu, 24 Mar 2016 09:57:38 -0700

     [ 
https://issues.apache.org/jira/browse/CRUNCH-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Wills updated CRUNCH-597:
------------------------------
    Attachment: CRUNCH-597.patch

Patch for this, which ended up being a bit more work than I had initially 
hoped. Without any objections to the upgrade (I think it's fair to say that 
it's time, and that it's okay to do this for 0.14.0, but LMK if I'm wrong about 
that), I'll commit this tomorrow.

> Unable to process parquet files using Hadoop
> --------------------------------------------
>
>                 Key: CRUNCH-597
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-597
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core, IO
>    Affects Versions: 0.13.0
>            Reporter: Stan Rosenberg
>            Assignee: Josh Wills
>         Attachments: CRUNCH-597.patch
>
>
> Current version of parquet-hadoop results in the following stack trace while 
> attempting to read from parquet file.
> {code}
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to 
> parquet.hadoop.ParquetInputSplit
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:406)
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.mapreduce.lib.input.FileSplit cannot be cast to 
> parquet.hadoop.ParquetInputSplit
>       at 
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107)
>       at 
> org.apache.crunch.impl.mr.run.CrunchRecordReader.initialize(CrunchRecordReader.java:140)
>       at 
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:478)
>       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:671)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:268)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> Here is the relevant code snippet which yields the above stack trace when 
> executed locally,
> {code}
> Pipeline pipeline = new MRPipeline(Crunch.class, conf);
> PCollection<Pair<String, Observation>> observations = 
>              pipeline.read(AvroParquetFileSource.builder(record).build(new 
> Path(args[0])))
>                          .parallelDo(new TranslateFn(), 
> Avros.tableOf(Avros.strings(), Avros.specifics(Observation.class)));
> for (Pair<String, Observation> pair : observations.materialize()) {
>       System.out.println(pair.second());
> }
> PipelineResult result = pipeline.done();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CRUNCH-597) Unable to process parquet files using Hadoop

Reply via email to