Gabor Kaszab created IMPALA-11363:
-------------------------------------
Summary: Use ReadValueBatch() when the members of Parquet
StructColumnReader
Key: IMPALA-11363
URL: https://issues.apache.org/jira/browse/IMPALA-11363
Project: IMPALA
Issue Type: Improvement
Components: Backend
Affects Versions: Impala 4.1.0
Reporter: Gabor Kaszab
IMPALA-9496 introduced the support for querying structs in the select list also
from Parquet tables. This required adding a new column reader:
StructColumnReader that has the usual interface as all the other Parquet column
readers. However, the ReadValueBatch of the StructColumnReader calls the
ReadValue() of its children instead of ReadValueBatch() so even though the
batched read is called on the StructColumnReader it will in fact do a
non-batched read.
The reason for this is that if the batched read would have been called on the
children readers then currently there is no way to set the parent struct to
null when the children reader find that the def_level_ indicates that the
struct member is null. It's even more complicated when there is a nested struct
column.
This has an impact on performance as querying a struct is slower than querying
its children together. As a solution I see 2 approaches:
1) Enhance the ScalarColumnReaders that when they do a ReadValueBatch() and see
based on def_level_ that there is a NULL value, then it also sets the parent
structs to NULL not just itself. For this the scalar reader should keep track
of the max def levels of the parent structs and their details in the internal
representation (e.g. tuple offset, etc.)
2) Only the first child of the struct is used as a struct child while the
others could be regular column readers not inside the struct. As a result the
first child wouldn't be read in a batched manner but then the struct could be
set based on the def_level coming from this child. All the other members could
be then read in a batched manner.
This needs some extra care when there are nested structs. In this case all the
struct would be added as children to the current struct, and what I described
above would only apply for the struct(s) at the bottom of the tree.
I personally would go for 1) as it is more straighforward and easier to
understand.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)