rf972 commented on pull request #29695:
URL: https://github.com/apache/spark/pull/29695#issuecomment-761105047
Once again thanks @huaxingao for the fixes! With the latest patch we are
now able to run all the tpch tests cleanly with no errors.
We also looked into gathering timing information from tpch test to
illustrate the benefits of aggregate pushdown. There are a few tpch tests
which show gains from aggregate pushdown. Not all tpch tests benefit since
some have queries with UDFs or joins, which prevent the aggregate pushdown.
Those tpch tests that benefit from aggregate pushdown are not designed to
illustrate the effects of aggregate pushdown and therefore the aggregate
benefits in execution time are small. For example, one test that benefits from
aggregate pushdown, uses a filter that reduces the data to such an extent,
that the aggregate operation has little data left to operate on.
Given the above we decided to modify one of the tpch queries to create a
test specifically designed to demonstrate the effects of aggregate pushdown.
Our goal was to create a test that models another reasonably typical use case
with limited data reduction by filtering. In these cases we expect that
aggregate pushdown should result in huge benefits such as reduction of data
that spark needs to receive and process.
In our case we modified test Q6, and simply removed the filter. This gives
us a test where we can compare the query with project pushdown, vs the query
with project and aggregate pushdown.
```
Test Seconds Bytes
------------------- ---------- -------------
Project only 898.199 4,161,369,576
Project w/Aggregate 187.104 17
```
As we can see from the above, the aggregate helps reduce the time of the
query by about 79%, and reduced the data transfer by 99.99%.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]