[jira] [Commented] (IMPALA-11444) Wrong results in reading wide rows from ORC

ASF subversion and git services (Jira) Sun, 04 Feb 2024 23:53:39 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-11444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814216#comment-17814216
 ]


ASF subversion and git services commented on IMPALA-11444:
----------------------------------------------------------

Commit 1292d18ad6d34053bd275feb54597d1b68d07840 in impala's branch 
refs/heads/branch-3.4.2 from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1292d18ad ]

IMPALA-11444: Fix wrong results when reading wide rows from ORC

After IMPALA-9228, ORC scanner reads rows into scratch batch where we
perform conjuncts and runtime filters. The survived rows will be picked
by the output row batch. We loop this until the output row batch is
filled (1024 rows by default) or we finish reading the ORC batch (1024
rows by default).

Usually the loop will have only 1 iteration since the scratch batch
capacity is also 1024. All rows of the current ORC batch can be
materialized into the scratch batch. However, when reading wide rows
that have tuple size larger than 4096 bytes, the scratch batch capacity
will be reduced to be lower 1024, i.e. the scratch batch can store less
than 1024 rows. In this case, we need more iterations in the loop.

The bug is that we didn't commit rows to the output row batch after each
iteration. The suvived rows will be ovewritten in the second iteration.

This is fixed in a later optimization (IMPALA-9469) which is missing in
the 3.x branch. This patch only pick the fix of it.

Tests:
 - Add test on wide tables with 2K columns

Change-Id: I09f1c23c817ad012587355c16f37f42d5fb41bff
Reviewed-on: http://gerrit.cloudera.org:8080/18745
Reviewed-by: Gabor Kaszab <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Wrong results in reading wide rows from ORC
> -------------------------------------------
>
>                 Key: IMPALA-11444
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11444
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.4.0, Impala 3.4.1
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Blocker
>              Labels: correctness
>             Fix For: Impala 3.4.2
>
>         Attachments: create-table-512cols.sql, widerow_512cols.orc
>
>
> The bug only exists in 3.4 branches where we have IMPALA-9228 and is missing 
> IMPALA-9469.
> When reading from a wide table with tuple size larger than 4096 bytes (4KB), 
> the orc scanner produces wrong results. The issue can be reproduced using the 
> attached CreateTable stmt and the ORC file.
> {code:sql}
> $ bin/impala-shell.sh --quiet -f create-table-512cols.sql
> $ bin/impala-shell.sh -B --quiet -q 'show table stats orc_tbl_512cols'
> -1    0       0B      NOT CACHED      NOT CACHED      ORC     false   
> hdfs://localhost:20500/test-warehouse/orc_tbl_512cols
> $ hdfs dfs -put widerow_512cols.orc 
> hdfs://localhost:20500/test-warehouse/orc_tbl_512cols
> $ bin/impala-shell.sh -q 'refresh orc_tbl_512cols'
> {code}
> Then run the following query:
> {code:sql}
> $ bin/impala-shell.sh -B -q "select * from orc_tbl_512cols where col0 = '1'"
> {code}
> The result should be only one row with all values as '1'. However, we get one 
> rwo with all values as '1024'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-11444) Wrong results in reading wide rows from ORC

Reply via email to