rf972 commented on pull request #29695: URL: https://github.com/apache/spark/pull/29695#issuecomment-741810936
Thanks @huaxingao for the fix! We were able to run the TPCH benchmark with your latest code and have these results for almost all of the TPCH tests. Although many of these tests do use aggregate, most cannot yet push down because of other portions of the query, which cannot be pushed down such as join and udf. Our simulation shows that Q18 and Q20 will also benefit from aggregate pushdown, once Issue #1 is fixed. These test results were generated using our v2 data source https://github.com/rf972/s3datasource. And using our docker based spark/s3 test environment: https://github.com/peterpuhov-github/dike ``` TPCH Aggregate Test No Pushdown Filter Filter, Project Aggregate Difference ------ ------------- ------------- ----------------- ----------- ------------ 1 754,998,934 744,353,513 152,005,702 152,005,702 0 2 288,283,676 240,552,680 19,640,050 19,640,050 0 3 950,316,971 495,621,641 90,182,777 90,182,777 0 4 925,953,149 761,524,915 179,910,653 179,910,653 0 5 951,729,284 806,714,442 164,092,188 164,092,188 0 6 754,998,934 14,320,106 1,565,944 17 1,565,927 7 951,731,111 426,753,589 91,221,930 91,221,930 0 8 975,686,104 833,050,619 211,038,608 211,038,608 0 9 1,070,094,601 1,047,408,449 258,031,293 258,031,293 0 10 950,319,184 216,925,396 55,738,379 55,738,379 0 11 360,560,535 360,554,115 50,546,842 50,546,842 0 12 925,953,149 203,531,165 37,713,185 37,713,185 0 13 195,318,037 195,318,037 96,823,834 96,823,834 0 14 778,953,541 33,509,818 7,152,453 7,152,453 0 15 1,511,407,582 58,271,128 9,143,413 Issue #1 16 144,139,239 143,184,520 17,410,512 17,410,512 0 17 1,557,907,082 1,510,046,884 164,440,699 164,440,699 0 18 1,705,315,905 1,705,315,905 183,515,611 Issue #1 19 778,953,541 40,553,750 8,311,313 8,311,313 0 20 899,140,386 258,559,476 34,420,567 Issue #1 21 2,437,362,944 2,349,560,183 498,738,573 498,738,573 0 22 219,681,859 217,455,483 17,168,762 Issue #1 ``` Issue #1 above is an exception we notice in Q15, Q18, Q20, and Q22. We came up with these tests to demonstrate the issue: ``` val df = sparkSession.table("h2.test.employee") var query1 = df.select($"DEPT", $"SALARY".as("value")) .groupBy($"DEPT") .agg(sum($"value").as("total")) .filter($"total" > 1000) query1.show() val decrease = udf { (x: Double, y: Double) => x - y} var query2 = df.select($"DEPT", decrease($"SALARY", $"BONUS").as("value"), $"SALARY", $"BONUS") .groupBy($"DEPT") .agg(sum($"value"), sum($"SALARY"), sum($"BONUS")) query2.show() ``` Please let us know if additional details are needed on this. Thanks ! We are actively looking at using larger data sets to generate meaningful test timing. We will share timing results as soon as we are done. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
