[jira] [Updated] (IMPALA-9747) More fine-grained codegen for text file scanners

Csaba Ringhofer (Jira) Thu, 14 May 2020 12:32:29 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Csaba Ringhofer updated IMPALA-9747:
------------------------------------
    Description: 
Currently if  the materialization of any column cannot be codegend for some 
reason (e.g. it is CHAR(N)), then the whole codegen is cancelled for the text 
scanner, see:
https://github.com/apache/impala/blob/b5805de3e65fd1c7154e4169b323bb38ddc54f4f/be/src/exec/text-converter.cc#L112
https://github.com/apache/impala/blob/58273fff601dcc763ac43f7cc275a174a2e18b6b/be/src/exec/hdfs-scanner.cc#L342

It would be much better to use the non-codegend path only for the problematic 
columns and use the codegend materialization for the rest + always do conjunct  
evaluation with codegen.

The codegend path orders slots based on the conjuncts that use them and 
evaluates conjuncts when the slots it need becomes available, so if the row is 
dropped then the rest of the slots do not need to be materialized. A simple 
solution would be to always do non-codegend slot materialization first so that 
they are ready if a conjunct needs them. Moving the columns that are not used 
by conjuncts to the end could be a further optimization.

This came up during the materialization of BINARY columns, which needs  base64 
decoding during materialization.

> More fine-grained codegen for text file scanners
> ------------------------------------------------
>
>                 Key: IMPALA-9747
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9747
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> Currently if  the materialization of any column cannot be codegend for some 
> reason (e.g. it is CHAR(N)), then the whole codegen is cancelled for the text 
> scanner, see:
> https://github.com/apache/impala/blob/b5805de3e65fd1c7154e4169b323bb38ddc54f4f/be/src/exec/text-converter.cc#L112
> https://github.com/apache/impala/blob/58273fff601dcc763ac43f7cc275a174a2e18b6b/be/src/exec/hdfs-scanner.cc#L342
> It would be much better to use the non-codegend path only for the problematic 
> columns and use the codegend materialization for the rest + always do 
> conjunct  evaluation with codegen.
> The codegend path orders slots based on the conjuncts that use them and 
> evaluates conjuncts when the slots it need becomes available, so if the row 
> is dropped then the rest of the slots do not need to be materialized. A 
> simple solution would be to always do non-codegend slot materialization first 
> so that they are ready if a conjunct needs them. Moving the columns that are 
> not used by conjuncts to the end could be a further optimization.
> This came up during the materialization of BINARY columns, which needs  
> base64 decoding during materialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-9747) More fine-grained codegen for text file scanners

Reply via email to