[jira] [Updated] (HIVE-19207) Support avro record writer for streaming ingest

Alan Gates (Jira) Wed, 21 Aug 2019 13:33:24 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-19207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alan Gates updated HIVE-19207:
------------------------------
    Target Version/s: 3.1.0, 3.0.0  (was: 3.0.0, 3.1.0)
            Assignee: Alan Gates  (was: Prasanth Jayachandran)
              Status: Patch Available  (was: Open)

Here is an initial pass at adding Avro support in streaming.

I modified the RecordWriter interface to be parameterized by the type of 
records that the user is writing. Previously it seemed to assume that all 
records could be turned into byte[] and handled as such. This is not convient 
when you already have structured data like Avro. I tried to do it in a way that 
did not break backward compatibility with existing RecordWriter implementations.

I've added two RecordWriter implementations, a StrictAvroWriter and a 
MappingAvroWriter.

StrictAvroWriter assumes that the Hive table and Avro records exactly match in 
schema (or at least close enough that the type conversion can be done). It also 
assumes that the Avro schema passed to it exactly matches every Avro record in 
the stream.

MappingAvroWriter takes a map of Hive column names to Avro paths. The avro path 
can be a simple column name, or a path through an Avro complex type. So the 
Hive column 'zipcode' could be mapped to an Avro column 'zipcode' or to an Avro 
record with a zipcode field (address.zipcode) or to an Avro map with a zipcode 
key (address[zipcode]). Again the system assumes the types are close enough 
that Hive can do type conversion if necessary. In this case the Avro schema 
passed to the writer does not have to exactly match every record in the stream, 
but it must be usuable to decode the referenced Avro columns for every record 
in the stream.

Both writers support all Avro types except Null as a top level object. Avro 
unions created just to allow a null value are "read through" to the non-null 
type and that type is used. For example, an Avro nullableString will become a 
String in Hive.

For both writers I did not use the existing AvroSerDe because it assumes that 
every Avro record has a schema encoded with it. In general this is not how I 
assume users generally stream their data. I did try to follow the same type 
conversions as the AvroSerDe.

> Support avro record writer for streaming ingest
> -----------------------------------------------
>
>                 Key: HIVE-19207
>                 URL: https://issues.apache.org/jira/browse/HIVE-19207
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Streaming
>    Affects Versions: 3.1.0, 3.0.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Alan Gates
>            Priority: Major
>         Attachments: HIVE-19207.patch
>
>
> Add support for Avro record writer in streaming ingest.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (HIVE-19207) Support avro record writer for streaming ingest

Reply via email to