Re: Discussion about IMPALA-10789: Early materialize expressions in ScanNode

Tim Armstrong Mon, 19 Jul 2021 21:14:40 -0700

I believe that this optimisation could be very beneficial. It is difficult
to cost - late materialisation can be a lot cheaper if there are predicates.

Pruning columns would be very useful for some queries just on its own. We
tried to implement a related optimisation -
https://issues.apache.org/jira/browse/IMPALA-2138 - and it showed some good
performance benefits, but the implementation in the planner was quite
complex because it tried to insert UNION operators at many points in the
plan.

We were hoping that https://issues.apache.org/jira/browse/IMPALA-3902 would
be the long-term solution to getting more parallelism so that it doesn't
matter whether expression evaluation is in the scan or not, but probably it
makes sense to get the community's opinion on that.

On Mon, 19 Jul 2021 at 20:35, hexianqing <hexianqing...@126.com> wrote:

> I have submitted the following issue,
> https://issues.apache.org/jira/browse/IMPALA-10789 <
> https://issues.apache.org/jira/browse/IMPALA-10789>
>
> In some cases, the performance is better that early materialize
> expressions in ScanNode.
> For example,
> SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT
> CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM
> functional_parquet.alltypesagg ) t
> The expression only needs to be evaluated once if materialize expressions
> in ScanNode.
> I have roughly implemented this feature and the performance has improved
> significantly.
> I would like to discuss whether this feature can be contributed.
>
> Thanks!
> Xianqing

Re: Discussion about IMPALA-10789: Early materialize expressions in ScanNode

Reply via email to