Gabor Kaszab created IMPALA-9469:
------------------------------------
Summary: ORC scanner vectorization for collection types
Key: IMPALA-9469
URL: https://issues.apache.org/jira/browse/IMPALA-9469
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Gabor Kaszab
https://issues.apache.org/jira/browse/IMPALA-9228 introduced vectorization for
primitive types and struct. This Jira covers the same for collections (array,
map) and structs containing collections.
*Prerequisite:*
1) As a prerequisite please check how IMPALA-9228 introduces scratch batches to
hold a batch rows, and also check how it's populated by primitives or struct
fields.
2) Read the following document to understand the difference between
materialising and non-materialising collection readers:
https://docs.google.com/presentation/d/1uj8m7y69o47MhpqCc0SJ03GDTtPDrg4m04eAFVmq34A
3) Check how parquet handles collections when populating its scratch batch.
Implementation details:
1) Taking care of materialising collections readers should be done similarly as
for primitive types. In this case each collection reader will write one slot
into the outgoing RowBatch per each collection it reads. In other words one
collection will be represented as one CollectionValue in RowBatch.
2) The other case is when the top-level collection reader doesn't materialise
directly into RowBatch, instead, it delegates the materialisation to its
children. In this case it's not guaranteed that number of required slots in the
RowBatch will equal to the number of collections in the collection reader.
E.g.: Let's assume a table with one column: list of integers. In this case if
the top-level ListColumnReader is not materialising then its child, the
IntColumnReader will. But the number of required slots will be the number of
int values within the collections instead of the number of collection as it
would be if the ListColumnReader was materialising directly.
As a Result if the scratch batch is being populated we might get to a situation
where a whole collection doesn't fit into the scratch batch. Check how Parquet
handles this case.
3) Once populating the scratch batch is done for collections it has to be
verified that codegen is also run in these cases. It should work out of the box
but let's make sure.
4) Currently ORC scanner chooses between row-by-row processing of the rows read
by ORC reader and scratch batch reading. Once this Jira is implemented the
row-by-row approach is not needed anymore.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]