Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19082
@maropu The codes to do aggregation are actually wrapped in a function
`doAggregateWithKeys`/`doAggregateWithoutKey`. This is also the part of
generated codes this PR improves by extracting functions.
My initial thought is, during the processing of the query, this function
`doAggregateWithKeys`/`doAggregateWithoutKey` actually only runs once to
aggregate on all rows. No matter it is a long function or not, we don't have
chance for JIT to step in. That said the length of this function doesn't impact
too much in JIT issue.
The long function issue affects the performance of wholestage codegen,
because it is run many times in non-compiled way. It drags the speed of other
generated codes. Since `doAggregateWithKeys`/`doAggregateWithoutKey` only run
once, it doesn't impact much. So wholestage codegen query is still faster than
non-wholestage codegen one.
This PR improves the aggregation because it extracts small functions from
`doAggregateWithKeys`/`doAggregateWithoutKey`. Those functions will be run many
times in the wrapping function. So JIT has room to step in now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]