I believe that this optimisation could be very beneficial. It is difficult to cost - late materialisation can be a lot cheaper if there are predicates.
Pruning columns would be very useful for some queries just on its own. We tried to implement a related optimisation - https://issues.apache.org/jira/browse/IMPALA-2138 - and it showed some good performance benefits, but the implementation in the planner was quite complex because it tried to insert UNION operators at many points in the plan. We were hoping that https://issues.apache.org/jira/browse/IMPALA-3902 would be the long-term solution to getting more parallelism so that it doesn't matter whether expression evaluation is in the scan or not, but probably it makes sense to get the community's opinion on that. On Mon, 19 Jul 2021 at 20:35, hexianqing <hexianqing...@126.com> wrote: > I have submitted the following issue, > https://issues.apache.org/jira/browse/IMPALA-10789 < > https://issues.apache.org/jira/browse/IMPALA-10789> > > In some cases, the performance is better that early materialize > expressions in ScanNode. > For example, > SELECT SUM(col), COUNT(col), MIN(col), MAX(col) FROM ( SELECT > CAST(regexp_extract(string_col, '(\\d+)', 0) AS bigint) col FROM > functional_parquet.alltypesagg ) t > The expression only needs to be evaluated once if materialize expressions > in ScanNode. > I have roughly implemented this feature and the performance has improved > significantly. > I would like to discuss whether this feature can be contributed. > > Thanks! > Xianqing