Mithun Radhakrishnan created HIVE-14789:
-------------------------------------------
Summary: Avro Table-reads bork when using SerDe-generated
table-schema.
Key: HIVE-14789
URL: https://issues.apache.org/jira/browse/HIVE-14789
Project: Hive
Issue Type: Bug
Components: Serializers/Deserializers
Affects Versions: 2.0.1, 1.2.1
Reporter: Mithun Radhakrishnan
AvroSerDe allows one to skip the table-columns in a table-definition when
creating a table, as long as the TBLPROPERTIES includes a valid
{{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are inferred
from processing the Avro schema file/literal.
The problem is that the inferred schema might not be congruent with the actual
schema in the Avro schema file/literal. Consider the following table definition:
{code:sql}
CREATE TABLE avro_schema_break_1
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{
"type": "record",
"name": "Messages",
"namespace": "net.myth",
"fields": [
{
"name": "header",
"type": [
"null",
{
"type": "record",
"name": "HeaderInfo",
"fields": [
{
"name": "inferred_event_type",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "event_type",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "event_version",
"type": [
"null",
"string"
],
"default": null
}
]
}
]
},
{
"name": "messages",
"type": {
"type": "array",
"items": {
"name": "MessageInfo",
"type": "record",
"fields": [
{
"name": "message_id",
"type": [
"null",
"string"
],
"doc": "Message-ID"
},
{
"name": "received_date",
"type": [
"null",
"long"
],
"doc": "Received Date"
},
{
"name": "sent_date",
"type": [
"null",
"long"
]
},
{
"name": "from_name",
"type": [
"null",
"string"
]
},
{
"name": "flags",
"type": [
"null",
{
"type": "record",
"name": "Flags",
"fields": [
{
"name": "is_seen",
"type": [
"null",
"boolean"
],
"default": null
},
{
"name": "is_read",
"type": [
"null",
"boolean"
],
"default": null
},
{
"name": "is_flagged",
"type": [
"null",
"boolean"
],
"default": null
}
]
}
],
"default": null
}
]
}
}
}
]
}');
{code}
This produces a table with the following schema:
{noformat}
2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main]
hive.log: DDL: struct avro_schema_break_1 {
struct<inferred_event_type:string,event_type:string,event_version:string>
header,
list<struct<message_id:string,received_date:i64,sent_date:i64,from_name:string,flags:struct<is_seen:bool,is_read:bool,is_flagged:bool>>>
messages}
{noformat}
Data written to this table using the AvroSchema from {{avro.schema.literal}}
using Pig's {{AvroStorage}} cannot be read using Hive using the generated table
schema. This is the exception one sees:
{noformat}
java.io.IOException: org.apache.avro.AvroTypeException: Found
net.myth.HeaderInfo, expecting union
at
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
at
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2019)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
at
org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1162)
at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1136)
at
org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:172)
at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104)
at
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver(TestCliDriver.java:59)
...
{noformat}
The only way to read this table is by using the attached
{{avro.schema.literal}} or {{avro.schema.url}}. This has implications on
systems where data could be produced externally to Hive. It also has
repercussions on table-replication using Falcon/GDM, in that the schema
file/literal needs to be replicated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)