[jira] [Created] (HIVE-24647) Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures

George Wood (Jira) Fri, 15 Jan 2021 16:43:24 -0800

George Wood created HIVE-24647:
----------------------------------

             Summary: Some Non-Vectorizable Queries are Run as Vectorized - 
Leads to Query Failures
                 Key: HIVE-24647
                 URL: https://issues.apache.org/jira/browse/HIVE-24647
             Project: Hive
          Issue Type: Bug
          Components: File Formats, hpl/sql, Query Processor, Vectorization
    Affects Versions: 3.1.2
         Environment: Hadoop and Hive instances running via Amazon EMR.


Queries run via `hive` CLI.

OS: Amazon Linux 2
            Reporter: George Wood
         Attachments: hive_complex_join_err.txt

A scenario has been discovered that seems to result in attempts to vectorize 
and run non-vectorizable queries. The conditions of this scenario are as 
follows:A scenario has been discovered that seems to result in attempts to 
vectorize and run non-vectorizable queries. The conditions of this scenario are 
as follows:
# 1. The query `SELECT`s non-primitive type columns. I have tested this with 
`ARRAY` and `RECORD` types. 
# 2. The query involves `JOIN`ing of two or more tables.
# 3. The tables are external, and they use the same storage type. This may 
occur with internal tables as well, but I am unable to test this in the Hive 
instance I have access to. The storage types I have tested this with are 
Parquet and Avro. If all tables use Parquet or all use Avro, the error is 
raised. If there is a mixture of storage types, the query runs successfully.

Given the following two table definitions:

{code:sql}
CREATE EXTERNAL TABLE experimental.join_test (
     int_col BIGINT,
     array_col ARRAY <BIGINT>,
     record_col STRUCT <a: BIGINT, b: BIGINT>
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", 
"long"]},
            {"name": "array_col",
             "type": ["null", {"items": ["null", "long"], "type": "array"}]},
            {"name": "record_col",
             "type": ["null",
                      {"fields": [{"name": "a", "type": ["null", "long"]},
                                  {"name": "b", "type": ["null", "long"]}],
                       "name": "record_col_0",
                       "type": "record"}]}],
 "name": "Root",
 "type": "record"}'
)
{code}

{code:sql}
CREATE EXTERNAL TABLE experimental.join_test2 (
     int_col2 BIGINT,
     str_col STRING
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test2/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", 
"long"]},
            {"name": "str_col", "type": ["null", "string"]}],
 "name": "Root",
 "type": "record"}'
)
{code}

I can successfully query both of these tables, running the following queries:

{code:sql}
SELECT * FROM experimental.join_test;

1       [1,2]   {"a":1,"b":2}
2       [3,4]   {"a":3,"b":4}
3       [5,6]   {"a":5,"b":6}
{code}
{code:sql}
SELECT * FROM experimental.join_test2;

1       should-appear-after-joining
2       should-appear-after-joining
3       should-appear-after-joining
4       shouldnt-appear-after-joining
{code}

I can also join the tables together, so long as I do not select the complex 
type columns:
{code:sql}
SELECT int_col, int_col2, str_col
FROM experimental.join_test AS jt
JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;

3       3       should-appear-after-joining
2       2       should-appear-after-joining
1       1       should-appear-after-joining
{code}

But as soon as complex columns are introduced to the `SELECT` clause, an error 
arises.
{code:sql}
SELECT *
FROM experimental.george_test AS jt
JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
{code}
I will attach the full stack trace as a file - it is rather large. The 
lowest-level error message given is slightly different depending on which 
complex type is involved in raising the error.* If it is an `ARRAY` type, the 
error message is 
`org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector 
cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector`* If it 
is a `RECORD` type, the error message is 
`org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector 
cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector`
If the tables are of different types - for example, if the table `join_test` 
was stored as Parquet instead of Avro - the query executes without issue.
>From what I can tell by looking at the stack trace, Hive is attempting to run 
>this query as a vectorized query, leading to the error.Some further points of 
>support for this idea:1. Vectorization is not supported in queries that 
>involved complex type columns, which lines up with the fact that queries only 
>fail when complex type columns are included in the query.2. Explicitly setting 
>`hive.vectorized.execution.enabled` to `'false'` allows these queries to 
>execute and finish normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HIVE-24647) Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures

Reply via email to