[
https://issues.apache.org/jira/browse/IMPALA-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zoltán Borók-Nagy updated IMPALA-7471:
--------------------------------------
Target Version: Impala 3.2.0 (was: Impala 3.1.0)
> Impala crashes or returns incorrect results when querying parquet nested types
> ------------------------------------------------------------------------------
>
> Key: IMPALA-7471
> URL: https://issues.apache.org/jira/browse/IMPALA-7471
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Tim Armstrong
> Assignee: Csaba Ringhofer
> Priority: Critical
> Labels: correctness, crash, parquet
> Attachments: test_users_131786401297925138_0.parquet
>
>
> From
> http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-bug-with-nested-arrays-of-structures-where-some-of/m-p/78507/highlight/false#M4779
> {quote}We found a case where Impala returns incorrect values from simple
> query. Our data contains nested array of structures and structures contains
> other structures.
> We generated minimal sample data allowing to reproduce the issue.
>
> SQL to create a table:
> {quote}
> {code}
> CREATE TABLE plat_test.test_users (
> id INT,
> name STRING,
> devices ARRAY<
> STRUCT<
> id:STRING,
> device_info:STRUCT<
> model:STRING
> >
> >
> >
> )
> STORED AS PARQUET
> {code}
> {quote}
> Please put attached parquet file to the location of the table and refresh the
> table.
> In sample data we have 2 users, one with 2 devices, second one with 3. Some
> of the devices.device_info.model fields are NULL.
>
> When I issue a query:
> {quote}
> {code}
> SELECT u.name, d.device_info.model as model
> FROM test_users u,
> u.devices d;
> {code}
> {quote}
> I'm expecting to get 5 records in results, but getting only one1.png
> If I change query to:
> {quote}
> {code}
> SELECT u.name, d.device_info.model as model
> FROM test_users u
> LEFT OUTER JOIN u.devices d;
> {code}
> {quote}
> I'm getting two records in the results, but still not as it should be.
> We found some workaround to this problem. If we add to the result columns
> device.id we will get all records from parquet file:
> {quote}
> {code}
> SELECT u.name, d.id, d.device_info.model as model
> FROM test_users u
> , u.devices d
> {code}
> {quote}
> And result is 3.png
>
> But we can't rely on this workaround, because we don't need device.id in all
> queries and Impala optimizes it, and as a result we are getting unpredicted
> results.
>
> I tested Hive query on this table and it returns expected results:
> {quote}
> {code}
> SELECT u.name, d.device_info.model
> FROM test_users u
> lateral view outer inline (u.devices) d;
> {code}
> {quote}
> results:
> 4.png
> Please advice if it's a problem in Impala engine or we did some mistake in
> our query.
>
> Best regards,
> Come2Play team.
> {quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]