[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR

dbtsai Fri, 11 Nov 2016 12:41:19 -0800

Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/15593
  
    @MLnick I'm hoping that we could abstract out the the implementation of 
using column major format as much as possible; as a result, in the future, new 
developers can understand the code without thinking hard. I agree that we have 
good comments and documentation in the code now, but having said that, with 
some effort, we are able to hide the detail using column major matrix.
    
    For example, let's say we have coeffs and gradient in column major 
matrices, we will be able to write L2 part of gradient in the following code 
without computing `isIntercept` and `featureIndex` using complicated logics.
    
    ```scala
          val coeffMatrix = new DenseMatrix(numClasses, numFeatures, Array(), 
false)
          val totalGradientMatrix = new DenseMatrix(numClasses, numFeatures, 
Array(), false)
    
          coeffMatrix.foreachActive { case (classIndex, featureIndex, value) =>
            // We do not apply regularization to the intercepts
            val isIntercept = fitIntercept && featureIndex == numFeatures
            if (!isIntercept) {
              // The following code will compute the loss of the 
regularization; also
              // the gradient of the regularization, and add back to 
totalGradientArray.
              sum += {
                if (standardization) {
                  totalGradientMatrix.update(classIndex, featureIndex,
                    totalGradientMatrix(classIndex, featureIndex) + regParamL2 
* value)
                  value * value
                } else {
                  if (featuresStd(featureIndex) != 0.0) {
                    // If `standardization` is false, we still standardize the 
data
                    // to improve the rate of convergence; as a result, we have 
to
                    // perform this reverse standardization by penalizing each 
component
                    // differently to get effectively the same objective 
function when
                    // the training dataset is not standardized.
                    val temp = value / (featuresStd(featureIndex) * 
featuresStd(featureIndex))
                    totalGradientMatrix.update(classIndex, featureIndex,
                      totalGradientMatrix(classIndex, featureIndex) + 
regParamL2 * temp)
                    value * temp
                  } else {
                    0.0
                  }
                }
              }
            }
          }
          0.5 * regParamL2 * sum
        }
    ```
    
    For L1 part, I'm thinking to do the following, 
    ```scala
              new BreezeOWLQN[(Int, Int), BDM[Double]]($(maxIter), 10, 
regParamL1Fun, $(tol))
    ```
    then we can implement L1 function as 
    ```scala
              def regParamL1Fun = (classIndex: Int, featureIndex: Int) => {
                // Remove the L1 penalization on the intercept
               val isIntercept = fitIntercept && featureIndex == numFeatures
                if (isIntercept) {
                  0.0
                } else {
                  if (standardizationParam) {
                    regParamL1
                  } else {
                    if (featuresStd(featureIndex) != 0.0) {
                      regParamL1 / featuresStd(featureIndex)
                    } else {
                      0.0
                    }
                  }
                }
              }
    ```
    
    By having coeffs and gradientSum in matrix, we also open up a opportunity 
to having both of them in sparse representation. With high dimensional 
problems, often times, both of them can be very sparse.
    
    I'm also aware that we're cutting 2.1 release now, and this is an important 
performance improvement, so I'm open to deal with those cleaning up work in a 
separate PR so at least we're able to merge this.  
    
    Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR

Reply via email to