Github user juliuszsompolski commented on a diff in the pull request: https://github.com/apache/spark/pull/21133#discussion_r184347998 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -238,12 +238,6 @@ object ApproximatePercentile { summaries = summaries.insert(value) // The result of QuantileSummaries.insert is un-compressed isCompressed = false - - // Currently, QuantileSummaries ignores the construction parameter compressThresHold, - // which may cause QuantileSummaries to occupy unbounded memory. We have to hack around here - // to make sure QuantileSummaries doesn't occupy infinite memory. - // TODO: Figure out why QuantileSummaries ignores construction parameter compressThresHold - if (summaries.sampled.length >= compressThresHoldBufferLength) compress() --- End diff -- I tested if this change doesn't cause `compress()` to not be called at all, and memory consumption to go ubounded, but it appears to be working good - the mem usage through jmap -histo:live when running `sql("select approx_percentile(id, array(0.1)) from range(10000000000L)").collect()` remains stable. The compress() is being called from `QuantileSummaries.insert()`, so it seems that the above TODO got resolved at some point.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org