Github user juliuszsompolski commented on a diff in the pull request:
https://github.com/apache/spark/pull/21133#discussion_r184347998
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
---
@@ -238,12 +238,6 @@ object ApproximatePercentile {
summaries = summaries.insert(value)
// The result of QuantileSummaries.insert is un-compressed
isCompressed = false
-
- // Currently, QuantileSummaries ignores the construction parameter
compressThresHold,
- // which may cause QuantileSummaries to occupy unbounded memory. We
have to hack around here
- // to make sure QuantileSummaries doesn't occupy infinite memory.
- // TODO: Figure out why QuantileSummaries ignores construction
parameter compressThresHold
- if (summaries.sampled.length >= compressThresHoldBufferLength)
compress()
--- End diff --
I tested if this change doesn't cause `compress()` to not be called at all,
and memory consumption to go ubounded, but it appears to be working good - the
mem usage through jmap -histo:live when running `sql("select
approx_percentile(id, array(0.1)) from range(10000000000L)").collect()` remains
stable.
The compress() is being called from `QuantileSummaries.insert()`, so it
seems that the above TODO got resolved at some point.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]