[jira] [Commented] (IMPALA-9469) ORC scanner vectorization for collection types

ASF subversion and git services (Jira) Sun, 04 Feb 2024 23:53:42 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-9469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814218#comment-17814218
 ]


ASF subversion and git services commented on IMPALA-9469:
---------------------------------------------------------

Commit 1292d18ad6d34053bd275feb54597d1b68d07840 in impala's branch 
refs/heads/branch-3.4.2 from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1292d18ad ]

IMPALA-11444: Fix wrong results when reading wide rows from ORC

After IMPALA-9228, ORC scanner reads rows into scratch batch where we
perform conjuncts and runtime filters. The survived rows will be picked
by the output row batch. We loop this until the output row batch is
filled (1024 rows by default) or we finish reading the ORC batch (1024
rows by default).

Usually the loop will have only 1 iteration since the scratch batch
capacity is also 1024. All rows of the current ORC batch can be
materialized into the scratch batch. However, when reading wide rows
that have tuple size larger than 4096 bytes, the scratch batch capacity
will be reduced to be lower 1024, i.e. the scratch batch can store less
than 1024 rows. In this case, we need more iterations in the loop.

The bug is that we didn't commit rows to the output row batch after each
iteration. The suvived rows will be ovewritten in the second iteration.

This is fixed in a later optimization (IMPALA-9469) which is missing in
the 3.x branch. This patch only pick the fix of it.

Tests:
 - Add test on wide tables with 2K columns

Change-Id: I09f1c23c817ad012587355c16f37f42d5fb41bff
Reviewed-on: http://gerrit.cloudera.org:8080/18745
Reviewed-by: Gabor Kaszab <gaborkas...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> ORC scanner vectorization for collection types
> ----------------------------------------------
>
>                 Key: IMPALA-9469
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9469
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Gabor Kaszab
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: complextype
>
> https://issues.apache.org/jira/browse/IMPALA-9228 introduced vectorization 
> for primitive types and struct. This Jira covers the same for collections 
> (array, map) and structs containing collections.
> *Prerequisite:*
> 1) As a prerequisite please check how IMPALA-9228 introduces scratch batches 
> to hold a batch rows, and also check how it's populated by primitives or 
> struct fields.
> 2) Read the following document to understand the difference between 
> materialising and non-materialising collection readers: 
> https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A
> 3) Check how parquet handles collections when populating its scratch batch.
> Implementation details:
> 1) Taking care of materialising collections readers should be done similarly 
> as for primitive types. In this case each collection reader will write one 
> slot into the outgoing RowBatch per each collection it reads. In other words 
> one collection will be represented as one CollectionValue in RowBatch.
> 2) The other case is when the top-level collection reader doesn't materialise 
> directly into RowBatch, instead, it delegates the materialisation to its 
> children. In this case it's not guaranteed that number of required slots in 
> the RowBatch will equal to the number of collections in the collection reader.
> E.g.: Let's assume a table with one column: list of integers. In this case if 
> the top-level ListColumnReader is not materialising then its child, the 
> IntColumnReader will. But the number of required slots will be the number of 
> int values within the collections instead of the number of collection as it 
> would be if the ListColumnReader was materialising directly.
> As a Result if the scratch batch is being populated we might get to a 
> situation where a whole collection doesn't fit into the scratch batch. Check 
> how Parquet handles this case.
> 3) Once populating the scratch batch is done for collections it has to be 
> verified that codegen is also run in these cases. It should work out of the 
> box but let's make sure.
> 4) Currently ORC scanner chooses between row-by-row processing of the rows 
> read by ORC reader and scratch batch reading. Once this Jira is implemented 
> the row-by-row approach is not needed anymore.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-9469) ORC scanner vectorization for collection types

Reply via email to