[jira] [Created] (IMPALA-11363) Use ReadValueBatch() when the members of Parquet StructColumnReader

Gabor Kaszab (Jira) Thu, 16 Jun 2022 02:31:11 -0700

Gabor Kaszab created IMPALA-11363:
-------------------------------------

             Summary: Use ReadValueBatch() when the members of Parquet 
StructColumnReader
                 Key: IMPALA-11363
                 URL: https://issues.apache.org/jira/browse/IMPALA-11363
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
    Affects Versions: Impala 4.1.0
            Reporter: Gabor Kaszab



IMPALA-9496 introduced the support for querying structs in the select list also 
from Parquet tables. This required adding a new column reader: 
StructColumnReader that has the usual interface as all the other Parquet column 
readers. However, the ReadValueBatch of the StructColumnReader calls the 
ReadValue() of its children instead of ReadValueBatch() so even though the 
batched read is called on the StructColumnReader it will in fact do a 
non-batched read.

The reason for this is that if the batched read would have been called on the 
children readers then currently there is no way to set the parent struct to 
null when the children reader find that the def_level_ indicates that the 
struct member is null. It's even more complicated when there is a nested struct 
column.

This has an impact on performance as querying a struct is slower than querying 
its children together. As a solution I see 2 approaches:

1) Enhance the ScalarColumnReaders that when they do a ReadValueBatch() and see 
based on def_level_ that there is a NULL value, then it also sets the parent 
structs to NULL not just itself. For this the scalar reader should keep track 
of the max def levels of the parent structs and their details in the internal 
representation (e.g. tuple offset, etc.)

2) Only the first child of the struct is used as a struct child while the 
others could be regular column readers not inside the struct. As a result the 
first child wouldn't be read in a batched manner but then the struct could be 
set based on the def_level coming from this child. All the other members could 
be then read in a batched manner.
This needs some extra care when there are nested structs. In this case all the 
struct would be added as children to the current struct, and what I described 
above would only apply for the struct(s) at the bottom of the tree.

I personally would go for 1) as it is more straighforward and easier to 
understand.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Created] (IMPALA-11363) Use ReadValueBatch() when the members of Parquet StructColumnReader

Reply via email to