Ryan Berti created BEAM-8953:
--------------------------------
Summary: Extend ParquetIO.Read/ReadFiles.Builder to support Avro
GenericData model
Key: BEAM-8953
URL: https://issues.apache.org/jira/browse/BEAM-8953
Project: Beam
Issue Type: Improvement
Components: examples-java
Affects Versions: 2.16.0
Reporter: Ryan Berti
When utilizing ParquetIO to deserialize objects into case classes in Scala,
we'd like to utilize a downstream converter which takes GenericRecords and
converts them to instances of our case classes, rather than relying on
ParquetIO to deserialize into the case class via reflection + implementing the
IndexedRecord interface.
The ParquetIO.Read / ParquetIO.ReadFiles Builders currently support a
filepattern + schema / schema arguments respectively. When using the Read /
ReadFiles Builders with these arguments, the underlying AvroParquetReader
object that gets created in the ParquetIO.ReadFiles.ReadFn method defaults to
utilizing an AvroReadSupport instance whose GenericData model gets set to
SpecificData. We'd like to have the the underlying AvroReadSupport utilize the
GenericData model, but there's currently no way to force this to happen via the
existing ParquetIO Read / ReadFiles builders.
I'd like to extend the ParquetIO Read / ReadFiles builders to support a new
method allowing users to define a GenericData model, which will then be passed
into the AvroParquetReader builder. I've tested and validated that this method
allows ParquetIO to generate GenericRecord instances without requiring that the
users classes can be reflectively instantiated and initialized via the
IndexedRecord interface.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)