[ 
https://issues.apache.org/jira/browse/IMPALA-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Armstrong updated IMPALA-7471:
----------------------------------
    Description: 
>From 
>http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-bug-with-nested-arrays-of-structures-where-some-of/m-p/78507/highlight/false#M4779


{quote}We found a case where Impala returns incorrect values from simple query. 
Our data contains nested array of structures and structures contains other 
structures.
We generated minimal sample data allowing to reproduce the issue.
 
SQL to create a table:
{quote}
{code}
CREATE TABLE plat_test.test_users (
  id INT,
  name STRING,   
  devices ARRAY<
    STRUCT<
      id:STRING,
      device_info:STRUCT<
        model:STRING
      >
    >
  >
)
STORED AS PARQUET
{code}
{quote}
Please put attached parquet file to the location of the table and refresh the 
table.
In sample data we have 2 users, one with 2 devices, second one with 3. Some of 
the devices.device_info.model fields are NULL.
 
When I issue a query:
{quote}
{code}
SELECT u.name, d.device_info.model as model
FROM test_users u,
u.devices d;
{code}
 {quote}

I'm expecting to get 5 records in results, but getting only one1.png
If I change query to:
 {quote}

{code}
SELECT u.name, d.device_info.model as model
FROM test_users u
LEFT OUTER JOIN u.devices d;
 {code}
{quote}
I'm getting two records in the results, but still not as it should be.

We found some workaround to this problem. If we add to the result columns 
device.id we will get all records from parquet file:
{quote}
{code}
SELECT u.name, d.id, d.device_info.model as model
FROM test_users u
, u.devices d
 {code}
{quote}
And result is 3.png
 
But we can't rely on this workaround, because we don't need device.id in all 
queries and Impala optimizes it, and as a result we are getting unpredicted 
results.
 
I tested Hive query on this table and it returns expected results:
{quote}
{code}
SELECT u.name, d.device_info.model
FROM test_users u
lateral view outer inline (u.devices) d;
 {code}
{quote}
results:
4.png
Please advice if it's a problem in Impala engine or we did some mistake in our 
query.
 
Best regards,
Come2Play team.
{quote}

  was:
>From 
>http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-bug-with-nested-arrays-of-structures-where-some-of/m-p/78507/highlight/false#M4779





> Impala crashes or returns incorrect results when querying parquet nested types
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-7471
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7471
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Critical
>              Labels: correctness, crash, parquet
>         Attachments: test_users_131786401297925138_0.parquet
>
>
> From 
> http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-bug-with-nested-arrays-of-structures-where-some-of/m-p/78507/highlight/false#M4779
> {quote}We found a case where Impala returns incorrect values from simple 
> query. Our data contains nested array of structures and structures contains 
> other structures.
> We generated minimal sample data allowing to reproduce the issue.
>  
> SQL to create a table:
> {quote}
> {code}
> CREATE TABLE plat_test.test_users (
>   id INT,
>   name STRING,   
>   devices ARRAY<
>     STRUCT<
>       id:STRING,
>       device_info:STRUCT<
>         model:STRING
>       >
>     >
>   >
> )
> STORED AS PARQUET
> {code}
> {quote}
> Please put attached parquet file to the location of the table and refresh the 
> table.
> In sample data we have 2 users, one with 2 devices, second one with 3. Some 
> of the devices.device_info.model fields are NULL.
>  
> When I issue a query:
> {quote}
> {code}
> SELECT u.name, d.device_info.model as model
> FROM test_users u,
> u.devices d;
> {code}
>  {quote}
> I'm expecting to get 5 records in results, but getting only one1.png
> If I change query to:
>  {quote}
> {code}
> SELECT u.name, d.device_info.model as model
> FROM test_users u
> LEFT OUTER JOIN u.devices d;
>  {code}
> {quote}
> I'm getting two records in the results, but still not as it should be.
> We found some workaround to this problem. If we add to the result columns 
> device.id we will get all records from parquet file:
> {quote}
> {code}
> SELECT u.name, d.id, d.device_info.model as model
> FROM test_users u
> , u.devices d
>  {code}
> {quote}
> And result is 3.png
>  
> But we can't rely on this workaround, because we don't need device.id in all 
> queries and Impala optimizes it, and as a result we are getting unpredicted 
> results.
>  
> I tested Hive query on this table and it returns expected results:
> {quote}
> {code}
> SELECT u.name, d.device_info.model
> FROM test_users u
> lateral view outer inline (u.devices) d;
>  {code}
> {quote}
> results:
> 4.png
> Please advice if it's a problem in Impala engine or we did some mistake in 
> our query.
>  
> Best regards,
> Come2Play team.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to