rf972 commented on pull request #29695:
URL: https://github.com/apache/spark/pull/29695#issuecomment-761105047


   Once again thanks @huaxingao for the fixes!  With the latest patch we are 
now able to run all the tpch tests cleanly with no errors.  
   
   We also looked into gathering timing information from tpch test to 
illustrate the benefits of aggregate pushdown.  There are a few tpch tests 
which show gains from aggregate pushdown.  Not all tpch tests benefit since 
some have queries with UDFs or joins, which prevent the aggregate pushdown.  
Those tpch tests that benefit from aggregate pushdown are not designed to 
illustrate the effects of aggregate pushdown and therefore the aggregate 
benefits in execution time are small.  For example, one test that benefits from 
aggregate pushdown, uses  a filter that reduces the data to such an extent, 
that the aggregate operation has little data left to operate on.  
   
   Given the above we decided to modify one of the tpch queries to create a 
test specifically designed to demonstrate the effects of aggregate pushdown.  
Our goal was to create a test that models another reasonably typical use case 
with limited data reduction by filtering.  In these cases we expect that 
aggregate pushdown should result in huge benefits such as reduction of data 
that spark needs to receive and process.
   
   In our case we modified test Q6, and simply removed the filter.  This gives 
us a test where we can compare the query with project pushdown, vs the query 
with project and aggregate pushdown.
   
    ```
                   Test    Seconds          Bytes
   ------------------- ----------  -------------
           Project only    898.199  4,161,369,576
   Project w/Aggregate    187.104             17
   ```
   
   As we can see from the above, the aggregate helps reduce the time of the 
query by about 79%, and reduced the data transfer by 99.99%.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to