[ 
https://issues.apache.org/jira/browse/HIVE-24647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Wood updated HIVE-24647:
-------------------------------
    Description: 
A scenario has been discovered that seems to result in attempts to vectorize 
and run non-vectorizable queries. The conditions of this scenario are as 
follows:A scenario has been discovered that seems to result in attempts to 
vectorize and run non-vectorizable queries. The conditions of this scenario are 
as follows:
# 1. The query {{SELECT}}s non-primitive type columns. I have tested this with 
{{ARRAY}} and {{RECORD}} types. 
# 2. The query involves {{JOIN}}ing of two or more tables.
# 3. The tables are external, and they use the same storage type. This may 
occur with internal tables as well, but I am unable to test this in the Hive 
instance I have access to. The storage types I have tested this with are 
Parquet and Avro. If all tables use Parquet or all use Avro, the error is 
raised. If there is a mixture of storage types, the query runs successfully.

I tried to dig through code to find and fix the issue myself, but unfortunately 
I did not have a complete enough understanding of Hive and its dependencies to 
pinpoint the issue.

Further examples and explanation - given the following two table definitions:
{code:sql}
CREATE EXTERNAL TABLE experimental.join_test (
     int_col BIGINT,
     array_col ARRAY <BIGINT>,
     record_col STRUCT <a: BIGINT, b: BIGINT>
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", 
"long"]},
            {"name": "array_col",
             "type": ["null", {"items": ["null", "long"], "type": "array"}]},
            {"name": "record_col",
             "type": ["null",
                      {"fields": [{"name": "a", "type": ["null", "long"]},
                                  {"name": "b", "type": ["null", "long"]}],
                       "name": "record_col_0",
                       "type": "record"}]}],
 "name": "Root",
 "type": "record"}'
)
{code}

{code:sql}
CREATE EXTERNAL TABLE experimental.join_test2 (
     int_col2 BIGINT,
     str_col STRING
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test2/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", 
"long"]},
            {"name": "str_col", "type": ["null", "string"]}],
 "name": "Root",
 "type": "record"}'
)
{code}

I can successfully query both of these tables, running the following queries:

{code:sql}
SELECT * FROM experimental.join_test;

1       [1,2]   {"a":1,"b":2}
2       [3,4]   {"a":3,"b":4}
3       [5,6]   {"a":5,"b":6}
{code}
{code:sql}
SELECT * FROM experimental.join_test2;

1       should-appear-after-joining
2       should-appear-after-joining
3       should-appear-after-joining
4       shouldnt-appear-after-joining
{code}

I can also join the tables together, so long as I do not select the complex 
type columns:
{code:sql}
SELECT int_col, int_col2, str_col
FROM experimental.join_test AS jt
JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;

3       3       should-appear-after-joining
2       2       should-appear-after-joining
1       1       should-appear-after-joining
{code}

But as soon as complex columns are introduced to the {{SELECT}} clause, an 
error arises.
{code:sql}
SELECT *
FROM experimental.george_test AS jt
JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
{code}
I will attach the full stack trace as a file - it is rather large. The 
lowest-level error message given is slightly different depending on which 
complex type is involved in raising the error.
* If it is an {{ARRAY}} type, the error message is 
{{org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector 
cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
* * If it is a {{RECORD}} type, the error message is 
{{org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector 
cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
If the tables are of different types - for example, if the table {{join_test}} 
was stored as Parquet instead of Avro - the query executes without issue.
>From what I can tell by looking at the stack trace, Hive is attempting to run 
>this query as a vectorized query, leading to the error.Some further points of 
>support for this idea:
# 1. Vectorization is not supported in queries that involved complex type 
columns, which lines up with the fact that queries only fail when complex type 
columns are included in the query.
# 2. Explicitly setting {{hive.vectorized.execution.enabled}} to {{'false'}} 
allows these queries to execute and finish normally.

  was:
A scenario has been discovered that seems to result in attempts to vectorize 
and run non-vectorizable queries. The conditions of this scenario are as 
follows:A scenario has been discovered that seems to result in attempts to 
vectorize and run non-vectorizable queries. The conditions of this scenario are 
as follows:
# 1. The query `SELECT`s non-primitive type columns. I have tested this with 
`ARRAY` and `RECORD` types. 
# 2. The query involves `JOIN`ing of two or more tables.
# 3. The tables are external, and they use the same storage type. This may 
occur with internal tables as well, but I am unable to test this in the Hive 
instance I have access to. The storage types I have tested this with are 
Parquet and Avro. If all tables use Parquet or all use Avro, the error is 
raised. If there is a mixture of storage types, the query runs successfully.

Given the following two table definitions:

{code:sql}
CREATE EXTERNAL TABLE experimental.join_test (
     int_col BIGINT,
     array_col ARRAY <BIGINT>,
     record_col STRUCT <a: BIGINT, b: BIGINT>
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", 
"long"]},
            {"name": "array_col",
             "type": ["null", {"items": ["null", "long"], "type": "array"}]},
            {"name": "record_col",
             "type": ["null",
                      {"fields": [{"name": "a", "type": ["null", "long"]},
                                  {"name": "b", "type": ["null", "long"]}],
                       "name": "record_col_0",
                       "type": "record"}]}],
 "name": "Root",
 "type": "record"}'
)
{code}

{code:sql}
CREATE EXTERNAL TABLE experimental.join_test2 (
     int_col2 BIGINT,
     str_col STRING
)
STORED AS AVRO
LOCATION 's3://s3-bucket/join_test2/'
TBLPROPERTIES (
  'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", 
"long"]},
            {"name": "str_col", "type": ["null", "string"]}],
 "name": "Root",
 "type": "record"}'
)
{code}

I can successfully query both of these tables, running the following queries:

{code:sql}
SELECT * FROM experimental.join_test;

1       [1,2]   {"a":1,"b":2}
2       [3,4]   {"a":3,"b":4}
3       [5,6]   {"a":5,"b":6}
{code}
{code:sql}
SELECT * FROM experimental.join_test2;

1       should-appear-after-joining
2       should-appear-after-joining
3       should-appear-after-joining
4       shouldnt-appear-after-joining
{code}

I can also join the tables together, so long as I do not select the complex 
type columns:
{code:sql}
SELECT int_col, int_col2, str_col
FROM experimental.join_test AS jt
JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;

3       3       should-appear-after-joining
2       2       should-appear-after-joining
1       1       should-appear-after-joining
{code}

But as soon as complex columns are introduced to the `SELECT` clause, an error 
arises.
{code:sql}
SELECT *
FROM experimental.george_test AS jt
JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
{code}
I will attach the full stack trace as a file - it is rather large. The 
lowest-level error message given is slightly different depending on which 
complex type is involved in raising the error.* If it is an `ARRAY` type, the 
error message is 
`org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector 
cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector`* If it 
is a `RECORD` type, the error message is 
`org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector 
cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector`
If the tables are of different types - for example, if the table `join_test` 
was stored as Parquet instead of Avro - the query executes without issue.
>From what I can tell by looking at the stack trace, Hive is attempting to run 
>this query as a vectorized query, leading to the error.Some further points of 
>support for this idea:1. Vectorization is not supported in queries that 
>involved complex type columns, which lines up with the fact that queries only 
>fail when complex type columns are included in the query.2. Explicitly setting 
>`hive.vectorized.execution.enabled` to `'false'` allows these queries to 
>execute and finish normally.


> Some Non-Vectorizable Queries are Run as Vectorized - Leads to Query Failures
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-24647
>                 URL: https://issues.apache.org/jira/browse/HIVE-24647
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats, hpl/sql, Query Processor, Vectorization
>    Affects Versions: 3.1.2
>         Environment: Hadoop and Hive instances running via Amazon EMR.
> Queries run via `hive` CLI.
> OS: Amazon Linux 2
>            Reporter: George Wood
>            Priority: Major
>         Attachments: hive_complex_join_err.txt
>
>
> A scenario has been discovered that seems to result in attempts to vectorize 
> and run non-vectorizable queries. The conditions of this scenario are as 
> follows:A scenario has been discovered that seems to result in attempts to 
> vectorize and run non-vectorizable queries. The conditions of this scenario 
> are as follows:
> # 1. The query {{SELECT}}s non-primitive type columns. I have tested this 
> with {{ARRAY}} and {{RECORD}} types. 
> # 2. The query involves {{JOIN}}ing of two or more tables.
> # 3. The tables are external, and they use the same storage type. This may 
> occur with internal tables as well, but I am unable to test this in the Hive 
> instance I have access to. The storage types I have tested this with are 
> Parquet and Avro. If all tables use Parquet or all use Avro, the error is 
> raised. If there is a mixture of storage types, the query runs successfully.
> I tried to dig through code to find and fix the issue myself, but 
> unfortunately I did not have a complete enough understanding of Hive and its 
> dependencies to pinpoint the issue.
> Further examples and explanation - given the following two table definitions:
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test (
>      int_col BIGINT,
>      array_col ARRAY <BIGINT>,
>      record_col STRUCT <a: BIGINT, b: BIGINT>
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test/'
> TBLPROPERTIES (
>   'avro.schema.literal'='{"fields": [{"name": "int_col", "type": ["null", 
> "long"]},
>             {"name": "array_col",
>              "type": ["null", {"items": ["null", "long"], "type": "array"}]},
>             {"name": "record_col",
>              "type": ["null",
>                       {"fields": [{"name": "a", "type": ["null", "long"]},
>                                   {"name": "b", "type": ["null", "long"]}],
>                        "name": "record_col_0",
>                        "type": "record"}]}],
>  "name": "Root",
>  "type": "record"}'
> )
> {code}
> {code:sql}
> CREATE EXTERNAL TABLE experimental.join_test2 (
>      int_col2 BIGINT,
>      str_col STRING
> )
> STORED AS AVRO
> LOCATION 's3://s3-bucket/join_test2/'
> TBLPROPERTIES (
>   'avro.schema.literal'='{"fields": [{"name": "int_col2", "type": ["null", 
> "long"]},
>             {"name": "str_col", "type": ["null", "string"]}],
>  "name": "Root",
>  "type": "record"}'
> )
> {code}
> I can successfully query both of these tables, running the following queries:
> {code:sql}
> SELECT * FROM experimental.join_test;
> 1     [1,2]   {"a":1,"b":2}
> 2     [3,4]   {"a":3,"b":4}
> 3     [5,6]   {"a":5,"b":6}
> {code}
> {code:sql}
> SELECT * FROM experimental.join_test2;
> 1     should-appear-after-joining
> 2     should-appear-after-joining
> 3     should-appear-after-joining
> 4     shouldnt-appear-after-joining
> {code}
> I can also join the tables together, so long as I do not select the complex 
> type columns:
> {code:sql}
> SELECT int_col, int_col2, str_col
> FROM experimental.join_test AS jt
> JOIN experimental.join_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> 3     3       should-appear-after-joining
> 2     2       should-appear-after-joining
> 1     1       should-appear-after-joining
> {code}
> But as soon as complex columns are introduced to the {{SELECT}} clause, an 
> error arises.
> {code:sql}
> SELECT *
> FROM experimental.george_test AS jt
> JOIN experimental.george_test2 AS jt2 ON jt.int_col = jt2.int_col2;
> {code}
> I will attach the full stack trace as a file - it is rather large. The 
> lowest-level error message given is slightly different depending on which 
> complex type is involved in raising the error.
> * If it is an {{ARRAY}} type, the error message is 
> {{org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector 
> cannot be cast to 
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> * * If it is a {{RECORD}} type, the error message is 
> {{org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector 
> cannot be cast to 
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector}}
> If the tables are of different types - for example, if the table 
> {{join_test}} was stored as Parquet instead of Avro - the query executes 
> without issue.
> From what I can tell by looking at the stack trace, Hive is attempting to run 
> this query as a vectorized query, leading to the error.Some further points of 
> support for this idea:
> # 1. Vectorization is not supported in queries that involved complex type 
> columns, which lines up with the fact that queries only fail when complex 
> type columns are included in the query.
> # 2. Explicitly setting {{hive.vectorized.execution.enabled}} to {{'false'}} 
> allows these queries to execute and finish normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to