[
https://issues.apache.org/jira/browse/HIVE-19207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alan Gates updated HIVE-19207:
------------------------------
Target Version/s: 3.1.0, 3.0.0 (was: 3.0.0, 3.1.0)
Assignee: Alan Gates (was: Prasanth Jayachandran)
Status: Patch Available (was: Open)
Here is an initial pass at adding Avro support in streaming.
I modified the RecordWriter interface to be parameterized by the type of
records that the user is writing. Previously it seemed to assume that all
records could be turned into byte[] and handled as such. This is not convient
when you already have structured data like Avro. I tried to do it in a way that
did not break backward compatibility with existing RecordWriter implementations.
I've added two RecordWriter implementations, a StrictAvroWriter and a
MappingAvroWriter.
StrictAvroWriter assumes that the Hive table and Avro records exactly match in
schema (or at least close enough that the type conversion can be done). It also
assumes that the Avro schema passed to it exactly matches every Avro record in
the stream.
MappingAvroWriter takes a map of Hive column names to Avro paths. The avro path
can be a simple column name, or a path through an Avro complex type. So the
Hive column 'zipcode' could be mapped to an Avro column 'zipcode' or to an Avro
record with a zipcode field (address.zipcode) or to an Avro map with a zipcode
key (address[zipcode]). Again the system assumes the types are close enough
that Hive can do type conversion if necessary. In this case the Avro schema
passed to the writer does not have to exactly match every record in the stream,
but it must be usuable to decode the referenced Avro columns for every record
in the stream.
Both writers support all Avro types except Null as a top level object. Avro
unions created just to allow a null value are "read through" to the non-null
type and that type is used. For example, an Avro nullableString will become a
String in Hive.
For both writers I did not use the existing AvroSerDe because it assumes that
every Avro record has a schema encoded with it. In general this is not how I
assume users generally stream their data. I did try to follow the same type
conversions as the AvroSerDe.
> Support avro record writer for streaming ingest
> -----------------------------------------------
>
> Key: HIVE-19207
> URL: https://issues.apache.org/jira/browse/HIVE-19207
> Project: Hive
> Issue Type: Sub-task
> Components: Streaming
> Affects Versions: 3.1.0, 3.0.0
> Reporter: Prasanth Jayachandran
> Assignee: Alan Gates
> Priority: Major
> Attachments: HIVE-19207.patch
>
>
> Add support for Avro record writer in streaming ingest.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)