[
https://issues.apache.org/jira/browse/IMPALA-9873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388833#comment-17388833
]
Amogh Margoor commented on IMPALA-9873:
---------------------------------------
https://docs.google.com/document/d/1QFu_Zu9nHuMpu5Pqb3qe62MbZPA88j_o7NtpZ2a2zSA/edit?usp=sharing.
> Skip decoding of non-materialised columns in Parquet
> ----------------------------------------------------
>
> Key: IMPALA-9873
> URL: https://issues.apache.org/jira/browse/IMPALA-9873
> Project: IMPALA
> Issue Type: Sub-task
> Components: Backend
> Reporter: Tim Armstrong
> Assignee: Amogh Margoor
> Priority: Major
>
> This is a first milestone for lazy materialization in parquet, focusing on
> avoiding decompression and decoding of columns.
> * Identify columns referenced by predicates and runtime row filters and
> determine what order the columns need to be materialised in. Probably we want
> to evaluate static predicates before runtime filters to match current
> behaviour.
> * Rework this loop so that it alternates between materialising columns and
> evaluating predicates:
> https://github.com/apache/impala/blob/052129c/be/src/exec/parquet/hdfs-parquet-scanner.cc#L1110
> * We probably need to keep track of filtered rows using a new data structure,
> e.g. bitmap
> * We need to then check that bitmap at each step to see if we skip
> materialising part or all of the following columns. E.g. if the first N rows
> were pruned, we can skip forward the remaining readers N rows.
> * This part may be a little tricky - there is the risk of adding overhead
> compared to the current code.
> * It is probably OK to just materialise the partition columns to start off
> with - avoiding materialising those is not going to buy that much.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]