Github user neggert commented on a diff in the pull request:
https://github.com/apache/spark/pull/15018#discussion_r93266668
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala
---
@@ -328,74 +336,69 @@ class IsotonicRegression private (private var
isotonic: Boolean) extends Seriali
return Array.empty
}
- // Pools sub array within given bounds assigning weighted average
value to all elements.
- def pool(input: Array[(Double, Double, Double)], start: Int, end:
Int): Unit = {
- val poolSubArray = input.slice(start, end + 1)
- val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
- val weight = poolSubArray.map(_._3).sum
+ // Keeps track of the start and end indices of the blocks.
blockBounds(start) gives the
+ // index of the end of the block and blockBounds(end) gives the index
of the start of the
+ // block. Entries that are not the start or end of the block are
meaningless. The idea is that
--- End diff --
It relies on knowing ahead of time wether `x` is the start or end
index<sup>1</sup>. If it's a start index, `blockBounds(x)` gives the ending
index of that block. If it's an end index, `blockBounds(x)` gives the starting
index of the block. So yes, the implementations of `blockStart` and `blockEnd`
are identical. I just have two different functions because it makes the code
more readable.
Maybe the comment from scikit-learn (where I borrowed this idea from)
explains it better? (their `target` = my `blockBounds`)
> target describes a list of blocks. At any time, if [i..j] (inclusive) is
> an active block, then blockBounds[i] := j and target[j] := i.
The trick is just in maintaining the array so that the above property is
always true. At initialization, it's trivially true because all blocks have
only one element, and `blockBounds(x)` = `x`. After initialization,
`blockBounds` is only modified by the merge function, which is set up to modify
`blockBounds` so that this property is preserved, then return the starting
index of the newly-merged block.
This is admittedly a bit tricky, but it's a lot faster than the
implementation I did where created a doubly-linked list of `Block` case
classes. I'm open to suggestions on how to explain it better.
<sup>1</sup> You could actually figure out whether you have a start or an
end index by comparing `blockBounds(x)` to `x`. The lesser value will be the
start index.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]