[ 
https://issues.apache.org/jira/browse/BEAM-8953?focusedWorklogId=365496&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-365496
 ]

ASF GitHub Bot logged work on BEAM-8953:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/Jan/20 21:09
            Start Date: 02/Jan/20 21:09
    Worklog Time Spent: 10m 
      Work Description: RyanBerti commented on pull request #10360: [BEAM-8953] 
Extend ParquetIO read builders for AvroParquetReader
URL: https://github.com/apache/beam/pull/10360#discussion_r362630680
 
 

 ##########
 File path: 
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
 ##########
 @@ -144,6 +166,9 @@ public static ReadFiles readFiles(Schema schema) {
     @Nullable
     abstract Schema getSchema();
 
+    @Nullable
+    abstract GenericData getAvroDataModel();
 
 Review comment:
   @aromanenko-dev sorry I didn't see this comment; as I pointed out, 
GenericData would join Schema as instance variables of the AutoValue annotated 
ReadFiles class (neither of which are serializable). From the docs, only the 
ParDo itself must be serializable 
(https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ParDo.java#L282).
 When the ParDo is created anonymously, the encompassing PTransform needs to be 
serializable, but that's not the case with ParquetIO. I've run the changes both 
locally via the DirectRunner and on Cloud Dataflow without issue; I was asking 
if there were a way to unit test the serialization requirements, but that 
doesn't seem to be an option. Let me know if building out serialization for all 
of the components of the PTransform is required, or if we can move ahead 
without serialization of the Schema and GenericData instance vars. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 365496)
    Time Spent: 2h 20m  (was: 2h 10m)

> Extend ParquetIO.Read/ReadFiles.Builder to support Avro GenericData model
> -------------------------------------------------------------------------
>
>                 Key: BEAM-8953
>                 URL: https://issues.apache.org/jira/browse/BEAM-8953
>             Project: Beam
>          Issue Type: Improvement
>          Components: examples-java
>    Affects Versions: 2.16.0
>            Reporter: Ryan Berti
>            Assignee: Ryan Berti
>            Priority: Minor
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When utilizing ParquetIO to deserialize objects into case classes in Scala, 
> we'd like to utilize a downstream converter which takes GenericRecords and 
> converts them to instances of our case classes, rather than relying on 
> ParquetIO to deserialize into the case class via reflection + implementing 
> the IndexedRecord interface.
> The ParquetIO.Read / ParquetIO.ReadFiles Builders currently support a 
> filepattern + schema / schema arguments respectively. When using the Read / 
> ReadFiles Builders with these arguments, the underlying AvroParquetReader 
> object that gets created in the ParquetIO.ReadFiles.ReadFn method defaults to 
> utilizing an AvroReadSupport instance whose GenericData model gets set to 
> SpecificData. We'd like to have the the underlying AvroReadSupport utilize 
> the GenericData model, but there's currently no way to force this to happen 
> via the existing ParquetIO Read / ReadFiles builders. 
> I'd like to extend the ParquetIO Read / ReadFiles builders to support a new 
> method allowing users to define a GenericData model, which will then be 
> passed into the AvroParquetReader builder. I've tested and validated that 
> this method allows ParquetIO to generate GenericRecord instances without 
> requiring that the users classes can be reflectively instantiated and 
> initialized via the IndexedRecord interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to