[
https://issues.apache.org/jira/browse/HIVE-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
George Wood updated HIVE-24647:
-------------------------------
Environment:
Hadoop and Hive instances running via Amazon EMR.
Hadoop version {{3.2.1}}
Hive version {{3.1.2}
Queries run via {{hive}} CLI.
OS: Amazon Linux 2
was:
Hadoop and Hive instances running via Amazon EMR.
Queries run via {{hive}} CLI.
OS: Amazon Linux 2
> Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures
> -----------------------------------------------------------------------------
>
> Key: HIVE-24647
> URL: https://issues.apache.org/jira/browse/HIVE-24647
> Project: Hive
> Issue Type: Bug
> Components: File Formats, hpl/sql, Query Processor, Vectorization
> Affects Versions: 3.1.2
> Environment: Hadoop and Hive instances running via Amazon EMR.
> Hadoop version {{3.2.1}}
> Hive version {{3.1.2}
> Queries run via {{hive}} CLI.
> OS: Amazon Linux 2
> Reporter: George Wood
> Priority: Major
> Attachments: hive_complex_join_err.txt
>
>
> A scenario has been discovered that seems to result in attempts to vectorize
> and run non-vectorizable queries. The conditions of this scenario are as
> follows:A scenario has been discovered that seems to result in attempts to
> vectorize and run non-vectorizable queries. The conditions of this scenario
> are as follows:
> # 1. The query {{SELECT}}s non-primitive type columns. I have tested this
> with {{ARRAY}} and {{RECORD}} types.
> # 2. The query involves {{JOIN}}ing of two or more tables.
> # 3. The tables are external, and they use the same storage type. This may
> occur with internal tables as well, but I am unable to test this in the Hive
> instance I have access to. The storage types I have tested this with are
> Parquet and Avro. If all tables use Parquet or all use Avro, the error is
> raised. If there is a mixture of storage types, the query runs successfully.
> I tried to dig through code to find and fix the issue myself, but
> unfortunately I did not have a complete enough understanding of Hive and its
> dependencies to pinpoint the issue.
> Further examples and explanation - given the following two table definitions:
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test (
> int_col BIGINT,
> array_col ARRAY <BIGINT>,
> record_col STRUCT <a: BIGINT, b: BIGINT>
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test/'
> TBLPROPERTIES (
> 'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null",
> "long"]},
> {"name": "array_col",
> "type": ["null", {"items": ["null", "long"], "type": "array"}]},
> {"name": "record_col",
> "type": ["null",
> {"fields": [{"name": "a", "type": ["null", "long"]},
> {"name": "b", "type": ["null", "long"]}],
> "name": "record_col_0",
> "type": "record"}]}],
> "name": "Root",
> "type": "record"}'
> )
> {code}
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test2 (
> int_col2 BIGINT,
> str_col STRING
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test2/'
> TBLPROPERTIES (
> 'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null",
> "long"]},
> {"name": "str_col", "type": ["null", "string"]}],
> "name": "Root",
> "type": "record"}'
> )
> {code}
> I can successfully query both of these tables, running the following queries:
> {code:sql}
> SELECT * FROM experimental.join_test;
> 1 [1,2] {"a":1,"b":2}
> 2 [3,4] {"a":3,"b":4}
> 3 [5,6] {"a":5,"b":6}
> {code}
> {code:sql}
> SELECT * FROM experimental.join_test2;
> 1 should-appear-after-joining
> 2 should-appear-after-joining
> 3 should-appear-after-joining
> 4 shouldnt-appear-after-joining
> {code}
> I can also join the tables together, so long as I do not select the complex
> type columns:
> {code:sql}
> SELECT int_col, int_col2, str_col
> FROM experimental.join_test AS jt
> JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> 3 3 should-appear-after-joining
> 2 2 should-appear-after-joining
> 1 1 should-appear-after-joining
> {code}
> But as soon as complex columns are introduced to the {{SELECT}} clause, an
> error arises.
> {code:sql}
> SELECT *
> FROM experimental.george_test AS jt
> JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> {code}
> I will attach the full stack trace as a file - it is rather large. The
> lowest-level error message given is slightly different depending on which
> complex type is involved in raising the error.
> * If it is an {{ARRAY}} type, the error message is
> {{org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector
> cannot be cast to
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> * * If it is a {{RECORD}} type, the error message is
> {{org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector
> cannot be cast to
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> If the tables are of different types - for example, if the table
> {{join_test}} was stored as Parquet instead of Avro - the query executes
> without issue.
> From what I can tell by looking at the stack trace, Hive is attempting to run
> this query as a vectorized query, leading to the error.Some further points of
> support for this idea:
> # 1. Vectorization is not supported in queries that involved complex type
> columns, which lines up with the fact that queries only fail when complex
> type columns are included in the query.
> # 2. Explicitly setting {{hive.vectorized.execution.enabled}} to {{'false'}}
> allows these queries to execute and finish normally.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)