[
https://issues.apache.org/jira/browse/HIVE-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
George Wood updated HIVE-24647:
-------------------------------
Description:
A scenario has been discovered that seems to result in attempts to vectorize
and run non-vectorizable queries. The conditions of this scenario are as
follows:A scenario has been discovered that seems to result in attempts to
vectorize and run non-vectorizable queries. The conditions of this scenario are
as follows:
# 1. The query {{SELECT}}s non-primitive type columns. I have tested this with
{{ARRAY}} and {{RECORD}} types.
# 2. The query involves {{JOIN}}ing of two or more tables.
# 3. The tables are external, and they use the same storage type. This may
occur with internal tables as well, but I am unable to test this in the Hive
instance I have access to. The storage types I have tested this with are
Parquet and Avro. If all tables use Parquet or all use Avro, the error is
raised. If there is a mixture of storage types, the query runs successfully.
I tried to dig through code to find and fix the issue myself, but unfortunately
I did not have a complete enough understanding of Hive and its dependencies to
pinpoint the issue.
Further examples and explanation - given the following two table definitions:
{code:sql}
CREATE EXTERNAL TABLE experimental.join_test (
int_col BIGINT,
array_col ARRAY <BIGINT>,
record_col STRUCT <a: BIGINT, b: BIGINT>
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test/'
TBLPROPERTIES (
'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null",
"long"]},
{"name": "array_col",
"type": ["null", {"items": ["null", "long"], "type": "array"}]},
{"name": "record_col",
"type": ["null",
{"fields": [{"name": "a", "type": ["null", "long"]},
{"name": "b", "type": ["null", "long"]}],
"name": "record_col_0",
"type": "record"}]}],
"name": "Root",
"type": "record"}'
)
{code}
{code:sql}
CREATE EXTERNAL TABLE experimental.join_test2 (
int_col2 BIGINT,
str_col STRING
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test2/'
TBLPROPERTIES (
'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null",
"long"]},
{"name": "str_col", "type": ["null", "string"]}],
"name": "Root",
"type": "record"}'
)
{code}
I can successfully query both of these tables, running the following queries:
{code:sql}
SELECT * FROM experimental.join_test;
1 [1,2] {"a":1,"b":2}
2 [3,4] {"a":3,"b":4}
3 [5,6] {"a":5,"b":6}
{code}
{code:sql}
SELECT * FROM experimental.join_test2;
1 should-appear-after-joining
2 should-appear-after-joining
3 should-appear-after-joining
4 shouldnt-appear-after-joining
{code}
I can also join the tables together, so long as I do not select the complex
type columns:
{code:sql}
SELECT int_col, int_col2, str_col
FROM experimental.join_test AS jt
JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;
3 3 should-appear-after-joining
2 2 should-appear-after-joining
1 1 should-appear-after-joining
{code}
But as soon as complex columns are introduced to the {{SELECT}} clause, an
error arises.
{code:sql}
SELECT *
FROM experimental.george_test AS jt
JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
{code}
I will attach the full stack trace as a file - it is rather large. The
lowest-level error message given is slightly different depending on which
complex type is involved in raising the error.
* If it is an {{ARRAY}} type, the error message is
{{org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector
cannot be cast to
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
* * If it is a {{RECORD}} type, the error message is
{{org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector
cannot be cast to
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
If the tables are of different types - for example, if the table {{join_test}}
was stored as Parquet instead of Avro - the query executes without issue.
>From what I can tell by looking at the stack trace, Hive is attempting to run
>this query as a vectorized query, leading to the error.Some further points of
>support for this idea:
# 1. Vectorization is not supported in queries that involved complex type
columns, which lines up with the fact that queries only fail when complex type
columns are included in the query.
# 2. Explicitly setting {{hive.vectorized.execution.enabled}} to {{'false'}}
allows these queries to execute and finish normally.
was:
A scenario has been discovered that seems to result in attempts to vectorize
and run non-vectorizable queries. The conditions of this scenario are as
follows:A scenario has been discovered that seems to result in attempts to
vectorize and run non-vectorizable queries. The conditions of this scenario are
as follows:
# 1. The query `SELECT`s non-primitive type columns. I have tested this with
`ARRAY` and `RECORD` types.
# 2. The query involves `JOIN`ing of two or more tables.
# 3. The tables are external, and they use the same storage type. This may
occur with internal tables as well, but I am unable to test this in the Hive
instance I have access to. The storage types I have tested this with are
Parquet and Avro. If all tables use Parquet or all use Avro, the error is
raised. If there is a mixture of storage types, the query runs successfully.
Given the following two table definitions:
{code:sql}
CREATE EXTERNAL TABLE experimental.join_test (
int_col BIGINT,
array_col ARRAY <BIGINT>,
record_col STRUCT <a: BIGINT, b: BIGINT>
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test/'
TBLPROPERTIES (
'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null",
"long"]},
{"name": "array_col",
"type": ["null", {"items": ["null", "long"], "type": "array"}]},
{"name": "record_col",
"type": ["null",
{"fields": [{"name": "a", "type": ["null", "long"]},
{"name": "b", "type": ["null", "long"]}],
"name": "record_col_0",
"type": "record"}]}],
"name": "Root",
"type": "record"}'
)
{code}
{code:sql}
CREATE EXTERNAL TABLE experimental.join_test2 (
int_col2 BIGINT,
str_col STRING
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test2/'
TBLPROPERTIES (
'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null",
"long"]},
{"name": "str_col", "type": ["null", "string"]}],
"name": "Root",
"type": "record"}'
)
{code}
I can successfully query both of these tables, running the following queries:
{code:sql}
SELECT * FROM experimental.join_test;
1 [1,2] {"a":1,"b":2}
2 [3,4] {"a":3,"b":4}
3 [5,6] {"a":5,"b":6}
{code}
{code:sql}
SELECT * FROM experimental.join_test2;
1 should-appear-after-joining
2 should-appear-after-joining
3 should-appear-after-joining
4 shouldnt-appear-after-joining
{code}
I can also join the tables together, so long as I do not select the complex
type columns:
{code:sql}
SELECT int_col, int_col2, str_col
FROM experimental.join_test AS jt
JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;
3 3 should-appear-after-joining
2 2 should-appear-after-joining
1 1 should-appear-after-joining
{code}
But as soon as complex columns are introduced to the `SELECT` clause, an error
arises.
{code:sql}
SELECT *
FROM experimental.george_test AS jt
JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
{code}
I will attach the full stack trace as a file - it is rather large. The
lowest-level error message given is slightly different depending on which
complex type is involved in raising the error.* If it is an `ARRAY` type, the
error message is
`org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector
cannot be cast to
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector`* If it
is a `RECORD` type, the error message is
`org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector
cannot be cast to
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector`
If the tables are of different types - for example, if the table `join_test`
was stored as Parquet instead of Avro - the query executes without issue.
>From what I can tell by looking at the stack trace, Hive is attempting to run
>this query as a vectorized query, leading to the error.Some further points of
>support for this idea:1. Vectorization is not supported in queries that
>involved complex type columns, which lines up with the fact that queries only
>fail when complex type columns are included in the query.2. Explicitly setting
>`hive.vectorized.execution.enabled` to `'false'` allows these queries to
>execute and finish normally.
> Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures
> -----------------------------------------------------------------------------
>
> Key: HIVE-24647
> URL: https://issues.apache.org/jira/browse/HIVE-24647
> Project: Hive
> Issue Type: Bug
> Components: File Formats, hpl/sql, Query Processor, Vectorization
> Affects Versions: 3.1.2
> Environment: Hadoop and Hive instances running via Amazon EMR.
> Queries run via `hive` CLI.
> OS: Amazon Linux 2
> Reporter: George Wood
> Priority: Major
> Attachments: hive_complex_join_err.txt
>
>
> A scenario has been discovered that seems to result in attempts to vectorize
> and run non-vectorizable queries. The conditions of this scenario are as
> follows:A scenario has been discovered that seems to result in attempts to
> vectorize and run non-vectorizable queries. The conditions of this scenario
> are as follows:
> # 1. The query {{SELECT}}s non-primitive type columns. I have tested this
> with {{ARRAY}} and {{RECORD}} types.
> # 2. The query involves {{JOIN}}ing of two or more tables.
> # 3. The tables are external, and they use the same storage type. This may
> occur with internal tables as well, but I am unable to test this in the Hive
> instance I have access to. The storage types I have tested this with are
> Parquet and Avro. If all tables use Parquet or all use Avro, the error is
> raised. If there is a mixture of storage types, the query runs successfully.
> I tried to dig through code to find and fix the issue myself, but
> unfortunately I did not have a complete enough understanding of Hive and its
> dependencies to pinpoint the issue.
> Further examples and explanation - given the following two table definitions:
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test (
> int_col BIGINT,
> array_col ARRAY <BIGINT>,
> record_col STRUCT <a: BIGINT, b: BIGINT>
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test/'
> TBLPROPERTIES (
> 'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null",
> "long"]},
> {"name": "array_col",
> "type": ["null", {"items": ["null", "long"], "type": "array"}]},
> {"name": "record_col",
> "type": ["null",
> {"fields": [{"name": "a", "type": ["null", "long"]},
> {"name": "b", "type": ["null", "long"]}],
> "name": "record_col_0",
> "type": "record"}]}],
> "name": "Root",
> "type": "record"}'
> )
> {code}
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test2 (
> int_col2 BIGINT,
> str_col STRING
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test2/'
> TBLPROPERTIES (
> 'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null",
> "long"]},
> {"name": "str_col", "type": ["null", "string"]}],
> "name": "Root",
> "type": "record"}'
> )
> {code}
> I can successfully query both of these tables, running the following queries:
> {code:sql}
> SELECT * FROM experimental.join_test;
> 1 [1,2] {"a":1,"b":2}
> 2 [3,4] {"a":3,"b":4}
> 3 [5,6] {"a":5,"b":6}
> {code}
> {code:sql}
> SELECT * FROM experimental.join_test2;
> 1 should-appear-after-joining
> 2 should-appear-after-joining
> 3 should-appear-after-joining
> 4 shouldnt-appear-after-joining
> {code}
> I can also join the tables together, so long as I do not select the complex
> type columns:
> {code:sql}
> SELECT int_col, int_col2, str_col
> FROM experimental.join_test AS jt
> JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> 3 3 should-appear-after-joining
> 2 2 should-appear-after-joining
> 1 1 should-appear-after-joining
> {code}
> But as soon as complex columns are introduced to the {{SELECT}} clause, an
> error arises.
> {code:sql}
> SELECT *
> FROM experimental.george_test AS jt
> JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> {code}
> I will attach the full stack trace as a file - it is rather large. The
> lowest-level error message given is slightly different depending on which
> complex type is involved in raising the error.
> * If it is an {{ARRAY}} type, the error message is
> {{org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector
> cannot be cast to
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> * * If it is a {{RECORD}} type, the error message is
> {{org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector
> cannot be cast to
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> If the tables are of different types - for example, if the table
> {{join_test}} was stored as Parquet instead of Avro - the query executes
> without issue.
> From what I can tell by looking at the stack trace, Hive is attempting to run
> this query as a vectorized query, leading to the error.Some further points of
> support for this idea:
> # 1. Vectorization is not supported in queries that involved complex type
> columns, which lines up with the fact that queries only fail when complex
> type columns are included in the query.
> # 2. Explicitly setting {{hive.vectorized.execution.enabled}} to {{'false'}}
> allows these queries to execute and finish normally.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)