[jira] [Work logged] (BEAM-5807) Add AvroIO.readRows

ASF GitHub Bot (JIRA) Mon, 22 Oct 2018 14:12:23 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-5807?focusedWorklogId=157200&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-157200
 ]


ASF GitHub Bot logged work on BEAM-5807:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 22/Oct/18 21:11
            Start Date: 22/Oct/18 21:11
    Worklog Time Spent: 10m 
      Work Description: kanterov edited a comment on issue #6777: [BEAM-5807] 
Conversion from AVRO records to rows
URL: https://github.com/apache/beam/pull/6777#issuecomment-431968627
 
 
   @akedin Thanks for the review. I would be happy to merge this sooner than 
later.
   
   I see this one of building blocks, that can be used to implement different 
higher-level transforms. I will provide a few examples I have in mind.
   
   ### Case 1. AvroIO and POJO
   
   As I understand, this is what you mentioned. Given AVRO schema and POJO 
class, we want to read `PCollection<POJO>` using `AvroIO`.
   
   How it works:
   1. we get `Schema` for AVRO using AVRO schema
   2. we get `Schema` for POJO from `SchemaProvider`
   3. there is a `PTransform` that converts rows between schemas if needed
   4. `PTransform#validate` checks if conversion is possible
   
   No codegen is required as opposed to using avro-compiler.
   
   ### Case 2. AvroIO and Row
   
   The difference from case 1 is that we don't have POJO. AVRO schema is 
dynamic, and we don't know it in compile time. We get it in runtime from 
external schema registry service, or from AVRO file headers in a filesystem. At 
the same time, we want to run SQL, the query itself can also be dynamic, or be 
based on a schema. For instance, you can imagine a generic data profiling 
pipeline implemented this way.
   
   ### Case 3. BigQuery and TableProvider
   
   We get BigQuery table schema, and we know table schema, then it works 
similar to case 2. It is something like external tables in Hive.
   
   We don't need to register tables in advance as it's done today, because we 
can list tables in a GCP project, and derive schemas for them. The same 
approach can be used for AVRO if there is metadata catalog like Hive metastore.
   
   ### Reading directly to Row
   
   I was trying to read directly to `Row` instead of `GenericRecord`. It's 
possible to get up to 3 times performance improvement by doing runtime codegen 
using bytebuddy or similar. There is an 
[article](https://techblog.rtbhouse.com/2017/04/18/fast-avro/) describing the 
similar approach.`GenericRecord`. 
   
   This approach enables projection and filter pushdowns to a data source. 
Recently there was another 
[article](https://dawn.cs.stanford.edu/2018/08/07/sparser/) about it. This 
could be done by injecting custom `DatumReader` into `AvroIO`.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 157200)
    Time Spent: 2h 20m  (was: 2h 10m)

> Add AvroIO.readRows
> -------------------
>
>                 Key: BEAM-5807
>                 URL: https://issues.apache.org/jira/browse/BEAM-5807
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-sql
>            Reporter: Gleb Kanterov
>            Assignee: Gleb Kanterov
>            Priority: Major
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> # Motivation
> At the moment the only way to read AVRO is through code generation with 
> avro-compiler and JavaBeanSchema. It makes it not possible to write 
> transforms that can work with dynamic schemas. AVRO has generic data type 
> called GenericRecord, reading is implemented in AvroIO.
> readGenericRecords. There is a code to convert GenericRecord to Row shipped 
> as a part of BigQueryIO. However, it doesn't support all types and nested 
> records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-5807) Add AvroIO.readRows

Reply via email to