LuciferYang opened a new pull request #29529:
URL: https://github.com/apache/spark/pull/29529


   ### What changes were proposed in this pull request?
   This pr just revert SPARK-32550  for performance.
   
   
   ### Why are the changes needed?
   I found that it had some negative impact on performance, the typical cases 
is "deterministic cardinality estimation" in `HyperLogLogPlusPlusSuite` when 
rsd is 0.001, we found the code that is significantly slower is line 41 in 
`HyperLogLogPlusPlusSuite`: `new 
SpecificInternalRow(hll.aggBufferAttributes.map(_.dataType)) `
   
   
https://github.com/apache/spark/blob/08b951b1cb58cea2c34703e43202fe7c84725c8a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala#L40-L44
   
   The size of "hll.aggBufferAttributes" in this case is 209716, the results of 
comparison before and after spark-32550 merged are as follows, The unit is ns:
   
   
     | After   SPARK-32550 createBuffer | After   SPARK-32550 end to end | 
Before   SPARK-32550 createBuffer | Before   SPARK-32550 end to end
   -- | -- | -- | -- | --
   rsd 0.001, n   1000 | 52715513243 | 53004810687 | 195807999 | 773977677
   rsd 0.001, n   5000 | 51881246165 | 52519358215 | 13689949 | 249974855
   rsd 0.001, n   10000 | 52234282788 | 52374639172 | 14199071 | 183452846
   rsd 0.001, n   50000 | 55503517122 | 55664035449 | 15219394 | 584477125
   rsd 0.001, n   100000 | 51862662845 | 52116774177 | 19662834 | 166483678
   rsd 0.001, n   500000 | 51619226715 | 52183189526 | 178048012 | 16681330
   rsd 0.001, n   1000000 | 54861366981 | 54976399142 | 226178708 | 18826340
   rsd 0.001, n   5000000 | 52023602143 | 52354615149 | 388173579 | 15446409
   rsd 0.001, n   10000000 | 53008591660 | 53601392304 | 533454460 | 16033032
   
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   ### How was this patch tested?
   `mvn test -pl sql/catalyst 
-DwildcardSuites=org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlusSuite
 -Dtest=none` 
   
   **Before**: 8 m 18 s 320 ms 
   **After**: 6s 278ms
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to