alkis commented on code in PR #40121:
URL: https://github.com/apache/spark/pull/40121#discussion_r1115447370


##########
core/src/main/scala/org/apache/spark/util/collection/PercentileHeap.scala:
##########
@@ -20,97 +20,55 @@ package org.apache.spark.util.collection
 import scala.collection.mutable.PriorityQueue
 
 /**
- * PercentileHeap is designed to be used to quickly track the percentile of a 
group of numbers
- * that may contain duplicates. Inserting a new number has O(log n) time 
complexity and
- * determining the percentile has O(1) time complexity.
- * The basic idea is to maintain two heaps: a smallerHalf and a largerHalf. 
The smallerHalf
- * stores the smaller half of all numbers while the largerHalf stores the 
larger half.
- * The sizes of two heaps need to match the percentage each time when a new 
number is inserted so
- * that the ratio of their sizes is percentage to (1 - percentage). Therefore 
each time when
- * percentile() is called we check if the sizes of two heaps match the 
percentage. If they do,
- * we should return the average of the two top values of heaps. Otherwise we 
return the top of the
- * heap which exceeds its percentage.
+ * PercentileHeap tracks the percentile of a collection of numbers.
+ *
+ * Insertion is O(log n), Lookup is O(1).
+ *
+ * The implementation keeps two heaps: a bottom heap (`botHeap`) and a top 
heap (`topHeap`). The
+ * bottom heap stores all the numbers below the percentile and the top heap 
stores the ones above
+ * the percentile. During insertion the relative sizes of the heaps are 
adjusted to match the
+ * target percentile.
  */
-private[spark] class PercentileHeap(percentage: Double = 0.5)(implicit val 
ord: Ordering[Double]) {
-  assert(percentage >= 0 && percentage <= 1)
+private[spark] class PercentileHeap(percentage: Double = 0.5) {
+  assert(percentage > 0 && percentage < 1)
 
-  /**
-   * Stores all the numbers less than the current percentile in a smallerHalf,
-   * i.e percentile is the maximum, at the root.
-   */
-  private[this] val smallerHalf = PriorityQueue.empty[Double](ord)
+  private[this] val topHeap = 
PriorityQueue.empty[Double](Ordering[Double].reverse)
+  private[this] val botHeap = PriorityQueue.empty[Double](Ordering[Double])

Review Comment:
   I made the change to `smallHeap` and `largeHeap` even though I disagree. I 
will leave this here for the record.
   
   - the old names do not match the implementation. They made sense when this 
was a median heap because they talk about halves (`smallerHalf` and 
`largerHalf`) but since this was made into a percentile heap we no longer have 
halves.
   - small/large is confusing. When one reads `smallHeap` or `smallerHeap` she 
does not know if the heap is small (as in has few elements) or if the heap 
contains small numbers. On the other hand `botHeap` or `bottomHeap` is 
unambiguous. It is the heap with the small numbers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to