[ 
https://issues.apache.org/jira/browse/HIVE-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Wood updated HIVE-24647:
-------------------------------
    Environment: 
Hadoop and Hive instances running via Amazon EMR.

Queries run via {{hive}} CLI.

OS: Amazon Linux 2

  was:
Hadoop and Hive instances running via Amazon EMR.

Queries run via `hive` CLI.

OS: Amazon Linux 2


> Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-24647
>                 URL: https://issues.apache.org/jira/browse/HIVE-24647
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats, hpl/sql, Query Processor, Vectorization
>    Affects Versions: 3.1.2
>         Environment: Hadoop and Hive instances running via Amazon EMR.
> Queries run via {{hive}} CLI.
> OS: Amazon Linux 2
>            Reporter: George Wood
>            Priority: Major
>         Attachments: hive_complex_join_err.txt
>
>
> A scenario has been discovered that seems to result in attempts to vectorize 
> and run non-vectorizable queries. The conditions of this scenario are as 
> follows:A scenario has been discovered that seems to result in attempts to 
> vectorize and run non-vectorizable queries. The conditions of this scenario 
> are as follows:
> # 1. The query {{SELECT}}s non-primitive type columns. I have tested this 
> with {{ARRAY}} and {{RECORD}} types. 
> # 2. The query involves {{JOIN}}ing of two or more tables.
> # 3. The tables are external, and they use the same storage type. This may 
> occur with internal tables as well, but I am unable to test this in the Hive 
> instance I have access to. The storage types I have tested this with are 
> Parquet and Avro. If all tables use Parquet or all use Avro, the error is 
> raised. If there is a mixture of storage types, the query runs successfully.
> I tried to dig through code to find and fix the issue myself, but 
> unfortunately I did not have a complete enough understanding of Hive and its 
> dependencies to pinpoint the issue.
> Further examples and explanation - given the following two table definitions:
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test (
>      int_col BIGINT,
>      array_col ARRAY <BIGINT>,
>      record_col STRUCT <a: BIGINT, b: BIGINT>
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test/'
> TBLPROPERTIES (
>   'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", 
> "long"]},
>             {"name": "array_col",
>              "type": ["null", {"items": ["null", "long"], "type": "array"}]},
>             {"name": "record_col",
>              "type": ["null",
>                       {"fields": [{"name": "a", "type": ["null", "long"]},
>                                   {"name": "b", "type": ["null", "long"]}],
>                        "name": "record_col_0",
>                        "type": "record"}]}],
>  "name": "Root",
>  "type": "record"}'
> )
> {code}
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test2 (
>      int_col2 BIGINT,
>      str_col STRING
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test2/'
> TBLPROPERTIES (
>   'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", 
> "long"]},
>             {"name": "str_col", "type": ["null", "string"]}],
>  "name": "Root",
>  "type": "record"}'
> )
> {code}
> I can successfully query both of these tables, running the following queries:
> {code:sql}
> SELECT * FROM experimental.join_test;
> 1     [1,2]   {"a":1,"b":2}
> 2     [3,4]   {"a":3,"b":4}
> 3     [5,6]   {"a":5,"b":6}
> {code}
> {code:sql}
> SELECT * FROM experimental.join_test2;
> 1     should-appear-after-joining
> 2     should-appear-after-joining
> 3     should-appear-after-joining
> 4     shouldnt-appear-after-joining
> {code}
> I can also join the tables together, so long as I do not select the complex 
> type columns:
> {code:sql}
> SELECT int_col, int_col2, str_col
> FROM experimental.join_test AS jt
> JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> 3     3       should-appear-after-joining
> 2     2       should-appear-after-joining
> 1     1       should-appear-after-joining
> {code}
> But as soon as complex columns are introduced to the {{SELECT}} clause, an 
> error arises.
> {code:sql}
> SELECT *
> FROM experimental.george_test AS jt
> JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> {code}
> I will attach the full stack trace as a file - it is rather large. The 
> lowest-level error message given is slightly different depending on which 
> complex type is involved in raising the error.
> * If it is an {{ARRAY}} type, the error message is 
> {{org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector 
> cannot be cast to 
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> * * If it is a {{RECORD}} type, the error message is 
> {{org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector 
> cannot be cast to 
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> If the tables are of different types - for example, if the table 
> {{join_test}} was stored as Parquet instead of Avro - the query executes 
> without issue.
> From what I can tell by looking at the stack trace, Hive is attempting to run 
> this query as a vectorized query, leading to the error.Some further points of 
> support for this idea:
> # 1. Vectorization is not supported in queries that involved complex type 
> columns, which lines up with the fact that queries only fail when complex 
> type columns are included in the query.
> # 2. Explicitly setting {{hive.vectorized.execution.enabled}} to {{'false'}} 
> allows these queries to execute and finish normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to