[jira] [Created] (IMPALA-9469) ORC scanner vectorization for collection types

Gabor Kaszab (Jira) Fri, 06 Mar 2020 02:56:45 -0800

Gabor Kaszab created IMPALA-9469:
------------------------------------

             Summary: ORC scanner vectorization for collection types
                 Key: IMPALA-9469
                 URL: https://issues.apache.org/jira/browse/IMPALA-9469
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Gabor Kaszab



https://issues.apache.org/jira/browse/IMPALA-9228 introduced vectorization for 
primitive types and struct. This Jira covers the same for collections (array, 
map) and structs containing collections.

*Prerequisite:*
1) As a prerequisite please check how IMPALA-9228 introduces scratch batches to 
hold a batch rows, and also check how it's populated by primitives or struct 
fields.
2) Read the following document to understand the difference between 
materialising and non-materialising collection readers: 
https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A
3) Check how parquet handles collections when populating its scratch batch.

Implementation details:
1) Taking care of materialising collections readers should be done similarly as 
for primitive types. In this case each collection reader will write one slot 
into the outgoing RowBatch per each collection it reads. In other words one 
collection will be represented as one CollectionValue in RowBatch.
2) The other case is when the top-level collection reader doesn't materialise 
directly into RowBatch, instead, it delegates the materialisation to its 
children. In this case it's not guaranteed that number of required slots in the 
RowBatch will equal to the number of collections in the collection reader.
E.g.: Let's assume a table with one column: list of integers. In this case if 
the top-level ListColumnReader is not materialising then its child, the 
IntColumnReader will. But the number of required slots will be the number of 
int values within the collections instead of the number of collection as it 
would be if the ListColumnReader was materialising directly.
As a Result if the scratch batch is being populated we might get to a situation 
where a whole collection doesn't fit into the scratch batch. Check how Parquet 
handles this case.
3) Once populating the scratch batch is done for collections it has to be 
verified that codegen is also run in these cases. It should work out of the box 
but let's make sure.
4) Currently ORC scanner chooses between row-by-row processing of the rows read 
by ORC reader and scratch batch reading. Once this Jira is implemented the 
row-by-row approach is not needed anymore.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (IMPALA-9469) ORC scanner vectorization for collection types

Reply via email to