[jira] [Commented] (IMPALA-11363) Use ReadValueBatch() when the members of Parquet StructColumnReader

Daniel Becker (Jira) Thu, 16 Jun 2022 02:48:07 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554998#comment-17554998
 ]


Daniel Becker commented on IMPALA-11363:
----------------------------------------

I agree that solution 1) seems to be cleaner and easier to understand.

> Use ReadValueBatch() when the members of Parquet StructColumnReader
> -------------------------------------------------------------------
>
>                 Key: IMPALA-11363
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11363
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.1.0
>            Reporter: Gabor Kaszab
>            Priority: Major
>              Labels: complextype
>
> IMPALA-9496 introduced the support for querying structs in the select list 
> also from Parquet tables. This required adding a new column reader: 
> StructColumnReader that has the usual interface as all the other Parquet 
> column readers. However, the ReadValueBatch of the StructColumnReader calls 
> the ReadValue() of its children instead of ReadValueBatch() so even though 
> the batched read is called on the StructColumnReader it will in fact do a 
> non-batched read.
> The reason for this is that if the batched read would have been called on the 
> children readers then currently there is no way to set the parent struct to 
> null when the children reader find that the def_level_ indicates that the 
> struct member is null. It's even more complicated when there is a nested 
> struct column.
> This has an impact on performance as querying a struct is slower than 
> querying its children together. As a solution I see 2 approaches:
> 1) Enhance the ScalarColumnReaders that when they do a ReadValueBatch() and 
> see based on def_level_ that there is a NULL value, then it also sets the 
> parent structs to NULL not just itself. For this the scalar reader should 
> keep track of the max def levels of the parent structs and their details in 
> the internal representation (e.g. tuple offset, etc.)
> 2) Only the first child of the struct is used as a struct child while the 
> others could be regular column readers not inside the struct. As a result the 
> first child wouldn't be read in a batched manner but then the struct could be 
> set based on the def_level coming from this child. All the other members 
> could be then read in a batched manner.
> This needs some extra care when there are nested structs. In this case all 
> the struct would be added as children to the current struct, and what I 
> described above would only apply for the struct(s) at the bottom of the tree.
> I personally would go for 1) as it is more straighforward and easier to 
> understand.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-11363) Use ReadValueBatch() when the members of Parquet StructColumnReader

Reply via email to