[GitHub] [spark] WangGuangxin opened a new pull request #28780: [SPARK-31952][SQL]Fix incorrect memory spill metric when doing Aggregate

GitBox Wed, 10 Jun 2020 02:06:48 -0700


WangGuangxin opened a new pull request #28780:
URL: https://github.com/apache/spark/pull/28780



   ### What changes were proposed in this pull request?
   It happends when hash aggregate downgrades to sort based aggregate. 
   `UnsafeExternalSorter.createWithExistingInMemorySorter` calls `spill` on an 
`InMemorySorter` immediately, but the memory pointed by InMemorySorter is 
acquired by outside `BytesToBytesMap`, instead the `allocatedPages` in 
`UnsafeExternalSorter`. So the memory spill bytes metric is always 0, but disk 
bytes spill metric is right. 
   
   Related code is at 
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L232.
   
   It can be reproduced by following step.
   ```
   bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 
1 --conf "spark.default.parallelism=1"
   scala> sql("select id, count(1) from range(10000000) group by 
id").write.csv("/tmp/result.json")
   ```
   
   Before this patch, the metric is
   
![image](https://user-images.githubusercontent.com/1312321/84248141-99cd2500-ab3b-11ea-8821-f78cf483557d.png)
   
   After this patch, the metric is
   
![image](https://user-images.githubusercontent.com/1312321/84248156-a18cc980-ab3b-11ea-86d3-fff85d239ed0.png)
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Test manually
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] WangGuangxin opened a new pull request #28780: [SPARK-31952][SQL]Fix incorrect memory spill metric when doing Aggregate

Reply via email to