[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

neggert Tue, 20 Dec 2016 08:00:43 -0800

Github user neggert commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15018#discussion_r93266668
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
    @@ -328,74 +336,69 @@ class IsotonicRegression private (private var 
isotonic: Boolean) extends Seriali
           return Array.empty
         }
     
    -    // Pools sub array within given bounds assigning weighted average 
value to all elements.
    -    def pool(input: Array[(Double, Double, Double)], start: Int, end: 
Int): Unit = {
    -      val poolSubArray = input.slice(start, end + 1)
     
    -      val weightedSum = poolSubArray.map(lp => lp._1 * lp._3).sum
    -      val weight = poolSubArray.map(_._3).sum
    +    // Keeps track of the start and end indices of the blocks. 
blockBounds(start) gives the
    +    // index of the end of the block and blockBounds(end) gives the index 
of the start of the
    +    // block. Entries that are not the start or end of the block are 
meaningless. The idea is that
    --- End diff --
    
    It relies on knowing ahead of time wether `x` is the start or end 
index<sup>1</sup>. If it's a start index, `blockBounds(x)` gives the ending 
index of that block. If it's an end index, `blockBounds(x)` gives the starting 
index of the block. So yes, the implementations of `blockStart` and `blockEnd` 
are identical. I just have two different functions because it makes the code 
more readable.
    
    Maybe the comment from scikit-learn (where I borrowed this idea from) 
explains it better? (their `target` = my `blockBounds`)
    
    > target describes a list of blocks.  At any time, if [i..j] (inclusive) is
    > an active block, then blockBounds[i] := j and target[j] := i.
    
    The trick is just in maintaining the array so that the above property is 
always true. At initialization, it's trivially true because all blocks have 
only one element, and `blockBounds(x)` = `x`. After initialization, 
`blockBounds` is only modified by the merge function, which is set up to modify 
`blockBounds` so that this property is preserved, then return the starting 
index of the newly-merged block.
    
    This is admittedly a bit tricky, but it's a lot faster than the 
implementation I did where created a doubly-linked list of `Block` case 
classes. I'm open to suggestions on how to explain it better.
    
    <sup>1</sup> You could actually figure out whether you have a start or an 
end index by comparing `blockBounds(x)` to `x`. The lesser value will be the 
start index.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15018: [SPARK-17455][MLlib] Improve PAVA implementation ...

Reply via email to