[
https://issues.apache.org/jira/browse/BEAM-5807?focusedWorklogId=157200&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-157200
]
ASF GitHub Bot logged work on BEAM-5807:
----------------------------------------
Author: ASF GitHub Bot
Created on: 22/Oct/18 21:11
Start Date: 22/Oct/18 21:11
Worklog Time Spent: 10m
Work Description: kanterov edited a comment on issue #6777: [BEAM-5807]
Conversion from AVRO records to rows
URL: https://github.com/apache/beam/pull/6777#issuecomment-431968627
@akedin Thanks for the review. I would be happy to merge this sooner than
later.
I see this one of building blocks, that can be used to implement different
higher-level transforms. I will provide a few examples I have in mind.
### Case 1. AvroIO and POJO
As I understand, this is what you mentioned. Given AVRO schema and POJO
class, we want to read `PCollection<POJO>` using `AvroIO`.
How it works:
1. we get `Schema` for AVRO using AVRO schema
2. we get `Schema` for POJO from `SchemaProvider`
3. there is a `PTransform` that converts rows between schemas if needed
4. `PTransform#validate` checks if conversion is possible
No codegen is required as opposed to using avro-compiler.
### Case 2. AvroIO and Row
The difference from case 1 is that we don't have POJO. AVRO schema is
dynamic, and we don't know it in compile time. We get it in runtime from
external schema registry service, or from AVRO file headers in a filesystem. At
the same time, we want to run SQL, the query itself can also be dynamic, or be
based on a schema. For instance, you can imagine a generic data profiling
pipeline implemented this way.
### Case 3. BigQuery and TableProvider
We get BigQuery table schema, and we know table schema, then it works
similar to case 2. It is something like external tables in Hive.
We don't need to register tables in advance as it's done today, because we
can list tables in a GCP project, and derive schemas for them. The same
approach can be used for AVRO if there is metadata catalog like Hive metastore.
### Reading directly to Row
I was trying to read directly to `Row` instead of `GenericRecord`. It's
possible to get up to 3 times performance improvement by doing runtime codegen
using bytebuddy or similar. There is an
[article](https://techblog.rtbhouse.com/2017/04/18/fast-avro/) describing the
similar approach.`GenericRecord`.
This approach enables projection and filter pushdowns to a data source.
Recently there was another
[article](https://dawn.cs.stanford.edu/2018/08/07/sparser/) about it. This
could be done by injecting custom `DatumReader` into `AvroIO`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 157200)
Time Spent: 2h 20m (was: 2h 10m)
> Add AvroIO.readRows
> -------------------
>
> Key: BEAM-5807
> URL: https://issues.apache.org/jira/browse/BEAM-5807
> Project: Beam
> Issue Type: Improvement
> Components: dsl-sql
> Reporter: Gleb Kanterov
> Assignee: Gleb Kanterov
> Priority: Major
> Time Spent: 2h 20m
> Remaining Estimate: 0h
>
> # Motivation
> At the moment the only way to read AVRO is through code generation with
> avro-compiler and JavaBeanSchema. It makes it not possible to write
> transforms that can work with dynamic schemas. AVRO has generic data type
> called GenericRecord, reading is implemented in AvroIO.
> readGenericRecords. There is a code to convert GenericRecord to Row shipped
> as a part of BigQueryIO. However, it doesn't support all types and nested
> records.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)