[
https://issues.apache.org/jira/browse/SPARK-19656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15896343#comment-15896343
]
Sean Owen commented on SPARK-19656:
-----------------------------------
It accepts it because you tell it that's what the InputFormat will return, but
it doesn't. The Class arg is there just for its compile-time type. That doesn't
make it so and it doesn't have a way of verifying it's what your InputFormat
returns.
newAPIHadoopFile doesn't load as anything in particular; the InputFormat does.
You are still really talking about Hadoop and Avro APIs.
I'm going to leave the conversation there and close this, as this is as much as
is reasonable to consider in the context of Spark. This is not a bug as-is. You
can take this info to explore how to work with Avro values elsewhere. A JIRA
can be reopened if you have a clear and reproducible problem in what Spark is
supposed to return or do and what it does. That does require understanding the
operation of Hadoop APIs. Questions should stay on the mailing list or SO, if
it's still in the realm of "how can I get this to work?"
> Can't load custom type from avro file to RDD with newAPIHadoopFile
> ------------------------------------------------------------------
>
> Key: SPARK-19656
> URL: https://issues.apache.org/jira/browse/SPARK-19656
> Project: Spark
> Issue Type: Question
> Components: Java API
> Affects Versions: 2.0.2
> Reporter: Nira Amit
>
> If I understand correctly, in scala it's possible to load custom objects from
> avro files to RDDs this way:
> {code}
> ctx.hadoopFile("/path/to/the/avro/file.avro",
> classOf[AvroInputFormat[MyClassInAvroFile]],
> classOf[AvroWrapper[MyClassInAvroFile]],
> classOf[NullWritable])
> {code}
> I'm not a scala developer, so I tried to "translate" this to java as best I
> could. I created classes that extend AvroKey and FileInputFormat:
> {code}
> public static class MyCustomAvroKey extends AvroKey<MyCustomClass>{};
> public static class MyCustomAvroReader extends
> AvroRecordReaderBase<MyCustomAvroKey, NullWritable, MyCustomClass> {
> // with my custom schema and all the required methods...
> }
> public static class MyCustomInputFormat extends
> FileInputFormat<MyCustomAvroKey, NullWritable>{
> @Override
> public RecordReader<MyCustomAvroKey, NullWritable>
> createRecordReader(InputSplit inputSplit, TaskAttemptContext
> taskAttemptContext) throws IOException, InterruptedException {
> return new MyCustomAvroReader();
> }
> }
> ...
> JavaPairRDD<MyCustomAvroKey, NullWritable> records =
> sc.newAPIHadoopFile("file:/path/to/datafile.avro",
> MyCustomInputFormat.class, MyCustomAvroKey.class,
> NullWritable.class,
> sc.hadoopConfiguration());
> MyCustomClass first = records.first()._1.datum();
> System.out.println("Got a result, some custom field: " +
> first.getSomeCustomField());
> {code}
> This compiles fine, but using a debugger I can see that `first._1.datum()`
> actually returns a `GenericData$Record` in runtime, not a `MyCustomClass`
> instance.
> And indeed, when the following line executes:
> {code}
> MyCustomClass first = records.first()._1.datum();
> {code}
> I get an exception:
> {code}
> java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record
> cannot be cast to my.package.containing.MyCustomClass
> {code}
> Am I doing it wrong? Or is this not possible in Java?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]