[ 
https://issues.apache.org/jira/browse/IMPALA-8077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742213#comment-16742213
 ] 

Csaba Ringhofer commented on IMPALA-8077:
-----------------------------------------

I see two way to implement this optimization:
1. Skip conversion during scanning (for columns that are not used in the 
predicate), and do it in a later step, for example at the end of predicate 
calculation.
2. Move toward lazy materialization, and decode columns that are not used in 
the predicate in a second pass.

I like 2. more, because that optimization could speed up many unrelated  
queries. IMPALA-5843 includes implementing row skipping for Parquet data page 
decoders, which could be used to radically speed up very selective queries.

If we want to implement this before IMPALA-5843 is finished, then I would go 
for solution 1.

> Avoid converting timestamps in dropped rows during Parquet scanning
> -------------------------------------------------------------------
>
>                 Key: IMPALA-8077
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8077
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>              Labels: parquet, performance, timestamp
>
> If flag convert_legacy_hive_parquet_utc_timestamps is true, then every 
> TIMESTAMP value is converted from UTC to local time during Parquet scanning. 
> This is done during column decoding, and Impala materializes every column 
> before calculating the WHERE predicate, so if a timestamp column is not in 
> the predicate, then the conversion is unnecessarily done in rows that fail 
> the predicate.
> Example:
> CREATE TABLE t (id INT, ts TIMESTAMP) STORED AS PARQUET;
> SELECT * FROM t WHERE id = 1;
> Timezone conversion will be done for every 'ts', even if the predicate 
> matches only a single row (lets ignore stat and dictionary filtering). The 
> CPU time of the query above is likely to be dominated by timezone conversion, 
> especially if the query is very selective.
> Note that the same overhead is "normal" if the predicate uses the timestamps 
> column e.g. in
> SELECT * FROM t WHERE ts = "2019.01.14 16:00:00"
> It would be possible to avoid this conversion, but this would be very hacky, 
> so this is out of the scope of this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to