LuciferYang commented on pull request #29366: URL: https://github.com/apache/spark/pull/29366#issuecomment-679078717
@srowen @HyukjinKwon @dongjoon-hyun @msamirkhan hi~ has anyone paid attention to the performance impact of this issue? I found that it had some negative impact on performance and create a new Jira [SPARK-32690](https://issues.apache.org/jira/browse/SPARK-32690) . the typical cases is "deterministic cardinality estimation" in `HyperLogLogPlusPlusSuite` when rsd is 0.001, we found the code that is significantly slower is line 41 in `HyperLogLogPlusPlusSuite` https://github.com/apache/spark/blob/08b951b1cb58cea2c34703e43202fe7c84725c8a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala#L40-L44 The size of "hll.aggBufferAttributes" in this case is 209716, the results of comparison before and after spark-32550 merged are as follows: | After SPARK-32550 create createBuffer | After SPARK-32550 end to end | Before SPARK-32550 create input | Before SPARK-32550 end to end -- | -- | -- | -- | -- rsd 0.001, n 1000 | 52715513243 | 53004810687 | 195807999 | 773977677 rsd 0.001, n 5000 | 51881246165 | 52519358215 | 13689949 | 249974855 rsd 0.001, n 10000 | 52234282788 | 52374639172 | 14199071 | 183452846 rsd 0.001, n 50000 | 55503517122 | 55664035449 | 15219394 | 584477125 rsd 0.001, n 100000 | 51862662845 | 52116774177 | 19662834 | 166483678 rsd 0.001, n 500000 | 51619226715 | 52183189526 | 178048012 | 16681330 rsd 0.001, n 1000000 | 54861366981 | 54976399142 | 226178708 | 18826340 rsd 0.001, n 5000000 | 52023602143 | 52354615149 | 388173579 | 15446409 rsd 0.001, n 10000000 | 53008591660 | 53601392304 | 533454460 | 16033032 We can use `mvn test -pl sql/catalyst -DwildcardSuites=org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlusSuite -Dtest=none` to verify the result above ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
