George Wood created HIVE-24647: ---------------------------------- Summary: Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures Key: HIVE-24647 URL: https://issues.apache.org/jira/browse/HIVE-24647 Project: Hive Issue Type: Bug Components: File Formats, hpl/sql, Query Processor, Vectorization Affects Versions: 3.1.2 Environment: Hadoop and Hive instances running via Amazon EMR.
Queries run via `hive` CLI. OS: Amazon Linux 2 Reporter: George Wood Attachments: hive_complex_join_err.txt A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows:A scenario has been discovered that seems to result in attempts to vectorize and run non-vectorizable queries. The conditions of this scenario are as follows: # 1. The query `SELECT`s non-primitive type columns. I have tested this with `ARRAY` and `RECORD` types. # 2. The query involves `JOIN`ing of two or more tables. # 3. The tables are external, and they use the same storage type. This may occur with internal tables as well, but I am unable to test this in the Hive instance I have access to. The storage types I have tested this with are Parquet and Avro. If all tables use Parquet or all use Avro, the error is raised. If there is a mixture of storage types, the query runs successfully. Given the following two table definitions: {code:sql} CREATE EXTERNAL TABLE experimental.join_test ( int_col BIGINT, array_col ARRAY <BIGINT>, record_col STRUCT <a: BIGINT, b: BIGINT> ) STORED AS AVRO LOCATION 's3://s3-bucket/join_test/' TBLPROPERTIES ( 'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", "long"]}, {"name": "array_col", "type": ["null", {"items": ["null", "long"], "type": "array"}]}, {"name": "record_col", "type": ["null", {"fields": [{"name": "a", "type": ["null", "long"]}, {"name": "b", "type": ["null", "long"]}], "name": "record_col_0", "type": "record"}]}], "name": "Root", "type": "record"}' ) {code} {code:sql} CREATE EXTERNAL TABLE experimental.join_test2 ( int_col2 BIGINT, str_col STRING ) STORED AS AVRO LOCATION 's3://s3-bucket/join_test2/' TBLPROPERTIES ( 'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", "long"]}, {"name": "str_col", "type": ["null", "string"]}], "name": "Root", "type": "record"}' ) {code} I can successfully query both of these tables, running the following queries: {code:sql} SELECT * FROM experimental.join_test; 1 [1,2] {"a":1,"b":2} 2 [3,4] {"a":3,"b":4} 3 [5,6] {"a":5,"b":6} {code} {code:sql} SELECT * FROM experimental.join_test2; 1 should-appear-after-joining 2 should-appear-after-joining 3 should-appear-after-joining 4 shouldnt-appear-after-joining {code} I can also join the tables together, so long as I do not select the complex type columns: {code:sql} SELECT int_col, int_col2, str_col FROM experimental.join_test AS jt JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2; 3 3 should-appear-after-joining 2 2 should-appear-after-joining 1 1 should-appear-after-joining {code} But as soon as complex columns are introduced to the `SELECT` clause, an error arises. {code:sql} SELECT * FROM experimental.george_test AS jt JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2; {code} I will attach the full stack trace as a file - it is rather large. The lowest-level error message given is slightly different depending on which complex type is involved in raising the error.* If it is an `ARRAY` type, the error message is `org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector`* If it is a `RECORD` type, the error message is `org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector` If the tables are of different types - for example, if the table `join_test` was stored as Parquet instead of Avro - the query executes without issue. >From what I can tell by looking at the stack trace, Hive is attempting to run >this query as a vectorized query, leading to the error.Some further points of >support for this idea:1. Vectorization is not supported in queries that >involved complex type columns, which lines up with the fact that queries only >fail when complex type columns are included in the query.2. Explicitly setting >`hive.vectorized.execution.enabled` to `'false'` allows these queries to >execute and finish normally. -- This message was sent by Atlassian Jira (v8.3.4#803005)