[
https://issues.apache.org/jira/browse/IMPALA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554998#comment-17554998
]
Daniel Becker commented on IMPALA-11363:
----------------------------------------
I agree that solution 1) seems to be cleaner and easier to understand.
> Use ReadValueBatch() when the members of Parquet StructColumnReader
> -------------------------------------------------------------------
>
> Key: IMPALA-11363
> URL: https://issues.apache.org/jira/browse/IMPALA-11363
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Affects Versions: Impala 4.1.0
> Reporter: Gabor Kaszab
> Priority: Major
> Labels: complextype
>
> IMPALA-9496 introduced the support for querying structs in the select list
> also from Parquet tables. This required adding a new column reader:
> StructColumnReader that has the usual interface as all the other Parquet
> column readers. However, the ReadValueBatch of the StructColumnReader calls
> the ReadValue() of its children instead of ReadValueBatch() so even though
> the batched read is called on the StructColumnReader it will in fact do a
> non-batched read.
> The reason for this is that if the batched read would have been called on the
> children readers then currently there is no way to set the parent struct to
> null when the children reader find that the def_level_ indicates that the
> struct member is null. It's even more complicated when there is a nested
> struct column.
> This has an impact on performance as querying a struct is slower than
> querying its children together. As a solution I see 2 approaches:
> 1) Enhance the ScalarColumnReaders that when they do a ReadValueBatch() and
> see based on def_level_ that there is a NULL value, then it also sets the
> parent structs to NULL not just itself. For this the scalar reader should
> keep track of the max def levels of the parent structs and their details in
> the internal representation (e.g. tuple offset, etc.)
> 2) Only the first child of the struct is used as a struct child while the
> others could be regular column readers not inside the struct. As a result the
> first child wouldn't be read in a batched manner but then the struct could be
> set based on the def_level coming from this child. All the other members
> could be then read in a batched manner.
> This needs some extra care when there are nested structs. In this case all
> the struct would be added as children to the current struct, and what I
> described above would only apply for the struct(s) at the bottom of the tree.
> I personally would go for 1) as it is more straighforward and easier to
> understand.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]