[
https://issues.apache.org/jira/browse/HIVE-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Maughan updated HIVE-15316:
---------------------------------
Description:
There's an issue when querying a table that has been created as Avro via CTAS
when the target struct is at least 2 struct-levels deep. It can be replicated
with the following steps:
{code}
CREATE TABLE a
STORED AS AVRO
AS
SELECT named_struct('c', named_struct('d', 1)) as b;
SELECT b FROM a;
org.apache.avro.AvroTypeException: Found default.record_0, expecting union
{code}
The reason for this is that during table creation, the Avro schema is generated
from the Hive columns in {{AvroSerDe}} and then passed through the Avro Schema
Parser: {{new Schema.Parser().parse(schema.toString())}}. For the above
example, this creates the below schema in the Avro file. Note that the lowest
level struct, {{record_0}} has {{"namespace": "default"}}.
{code}
{
"type": "record",
"name": "a",
"namespace": "default",
"fields": [
{
"name": "b",
"type": [
"null",
{
"type": "record",
"name": "record_1",
"namespace": "",
"doc": "struct<c:struct<d:int>>",
"fields": [
{
"name": "c",
"type": [
"null",
{
"type": "record",
"name": "record_0",
"namespace": "default",
"doc": "struct<d:int>",
"fields": [
{
"name": "d",
"type": [ "null", "int" ],
"doc": "int",
"default": null
}
]
}
],
"doc": "struct<d:int>",
"default": null
}
]
}
],
"default": null
}
]
}
{code}
On a subsequent select query, the Avro schema is again generated from the Hive
columns. However, this time it is not passed through the Avro Schema Parser and
the {{namespace}} attribute is not present in {{record_0}}. The actual Error
message _"Found default.record_0, expecting union"_ is slightly misleading.
Although it is a expecting a union, it is specifically expecting a null or a
record named {{record_0}} but it finds {{default.record_0}}.
I believe this is a bug in Avro. I'm not sure whether correct behaviour is to
cascade the namespace down or not but it is definitely an inconsistency between
creating a schema via the builders and parser. I've created
[AVRO-1965|https://issues.apache.org/jira/browse/AVRO-1965] for this. However,
I believe that defensively passing the schema through the Avro Schema Parser on
a select query would fix this issue in Hive without an Avro fix and version
bump in Hive.
was:
There's an issue when querying a table that has been created as Avro via CTAS
when the target struct is at least 2 struct-levels deep. It can be replicated
with the following steps:
{code}
CREATE TABLE a
STORED AS AVRO
AS
SELECT named_struct('c', named_struct('d', 1)) as b;
SELECT b FROM a;
org.apache.avro.AvroTypeException: Found default.record_0, expecting union
{code}
The reason for this is that during table creation, the Avro schema is generated
from the Hive columns in {{AvroSerDe}} and then passed through the Avro Schema
Parser: {{new Schema.Parser().parse(schema.toString())}}. For the above
example, this creates the below schema in the Avro file. Note that the lowest
level struct, {{record_0}} has {{"namespace": "default"}}.
{code}
{
"type": "record",
"name": "a",
"namespace": "default",
"fields": [
{
"name": "b",
"type": [
"null",
{
"type": "record",
"name": "record_1",
"namespace": "",
"doc": "struct<c:struct<d:int>>",
"fields": [
{
"name": "c",
"type": [
"null",
{
"type": "record",
"name": "record_0",
"namespace": "default",
"doc": "struct<d:int>",
"fields": [
{
"name": "d",
"type": [ "null", "int" ],
"doc": "int",
"default": null
}
]
}
],
"doc": "struct<d:int>",
"default": null
}
]
}
],
"default": null
}
]
}
{code}
On a subsequent select query, the Avro schema is again generated from the Hive
columns. However, this time it is not passed through the Avro Schema Parser and
the {{namespace}} attribute is not present in {{record_0}}. The actual Error
message _"Found default.record_0, expecting union"_ is slightly misleading.
Although it is a expected a union, it is specifically expected a null or a
record named {{record_0}} but it finds {{default.record_0}}.
I believe this is a bug in Avro. I'm not sure whether correct behaviour is to
cascade the namespace down or not but it is definitely an inconsistency between
creating a schema via the builders and parser. I've created
[AVRO-1965|https://issues.apache.org/jira/browse/AVRO-1965] for this. However,
I believe that defensively passing the schema through the Avro Schema Parser on
a select query would fix this issue in Hive without an Avro fix and version
bump in Hive.
> CTAS STORED AS AVRO: AvroTypeException Found default.record_0, expecting union
> ------------------------------------------------------------------------------
>
> Key: HIVE-15316
> URL: https://issues.apache.org/jira/browse/HIVE-15316
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 2.1.0
> Reporter: David Maughan
> Priority: Minor
>
> There's an issue when querying a table that has been created as Avro via CTAS
> when the target struct is at least 2 struct-levels deep. It can be replicated
> with the following steps:
> {code}
> CREATE TABLE a
> STORED AS AVRO
> AS
> SELECT named_struct('c', named_struct('d', 1)) as b;
> SELECT b FROM a;
> org.apache.avro.AvroTypeException: Found default.record_0, expecting union
> {code}
> The reason for this is that during table creation, the Avro schema is
> generated from the Hive columns in {{AvroSerDe}} and then passed through the
> Avro Schema Parser: {{new Schema.Parser().parse(schema.toString())}}. For the
> above example, this creates the below schema in the Avro file. Note that the
> lowest level struct, {{record_0}} has {{"namespace": "default"}}.
> {code}
> {
> "type": "record",
> "name": "a",
> "namespace": "default",
> "fields": [
> {
> "name": "b",
> "type": [
> "null",
> {
> "type": "record",
> "name": "record_1",
> "namespace": "",
> "doc": "struct<c:struct<d:int>>",
> "fields": [
> {
> "name": "c",
> "type": [
> "null",
> {
> "type": "record",
> "name": "record_0",
> "namespace": "default",
> "doc": "struct<d:int>",
> "fields": [
> {
> "name": "d",
> "type": [ "null", "int" ],
> "doc": "int",
> "default": null
> }
> ]
> }
> ],
> "doc": "struct<d:int>",
> "default": null
> }
> ]
> }
> ],
> "default": null
> }
> ]
> }
> {code}
> On a subsequent select query, the Avro schema is again generated from the
> Hive columns. However, this time it is not passed through the Avro Schema
> Parser and the {{namespace}} attribute is not present in {{record_0}}. The
> actual Error message _"Found default.record_0, expecting union"_ is slightly
> misleading. Although it is a expecting a union, it is specifically expecting
> a null or a record named {{record_0}} but it finds {{default.record_0}}.
> I believe this is a bug in Avro. I'm not sure whether correct behaviour is to
> cascade the namespace down or not but it is definitely an inconsistency
> between creating a schema via the builders and parser. I've created
> [AVRO-1965|https://issues.apache.org/jira/browse/AVRO-1965] for this.
> However, I believe that defensively passing the schema through the Avro
> Schema Parser on a select query would fix this issue in Hive without an Avro
> fix and version bump in Hive.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)