[jira] [Commented] (SPARK-35598) Improve Spark-ML PCA analysis

Apache Spark (Jira) Mon, 07 Jun 2021 08:21:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358660#comment-17358660
 ]


Apache Spark commented on SPARK-35598:
--------------------------------------

User 'Antozamm' has created a pull request for this issue:
https://github.com/apache/spark/pull/32808

> Improve Spark-ML PCA analysis
> -----------------------------
>
>                 Key: SPARK-35598
>                 URL: https://issues.apache.org/jira/browse/SPARK-35598
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 3.1.2
>            Reporter: Antonio Zammuto
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> When performing a PCA, the covariance matrix is calculate first.
>  In the function RowMAtrix.computeCovariance() there is the following code:
> {code:java}
>     if (rows.first().isInstanceOf[DenseVector]) {
>       computeDenseVectorCovariance(mean, n, m)
>     } else {
>       computeSparseVectorCovariance(mean, n, m)
>     }
> {code}
> As can be seen, there are two different function to compute the covariance 
> matrix depending on whether the matrix is sparse or dense. Now, the decision 
> if the matrix is sparse or dense is based only on the type of the first row 
> (DenseVector or SparseVector).
> I propose to calculate the sparsity of the matrix, as ratio between 
> number-of-zeros and total-number-of elements. And use the sparsity to switch 
> between the two functions.
> *Q2. What problem is this proposal NOT designed to solve?*
> The 2 functions give slightly different results. This SPIP doesn't solve this 
> issue
> *Q3. How is it done today, and what are the limits of current practice?*
> Basing on the type of the first rows only is questionable.
> Better would be to base on the entire matrix, and on a mathematical concept, 
> like sparsity.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The code is implemented in a more meaningful way, the decision is done on the 
> entire matrix
> *Q5. Who cares? If you are successful, what difference will it make?*
> More consistent code for PCA analysis, it will be beneficial for who is using 
> the PCA analysis
> *Q6. What are the risks?*
> You still need a threshold to decide if the matrix is sparse or not. There is 
> no universally defined value to define the sparsity. 
>  Therefore the threshold of 50% that I choose is arbitrary, and might not be 
> the best in all cases.
>  Still I consider as an improvement compared to the current implementation.
> *Q7. How long will it take?*
> The change is easy to implement and test
> *Q8. What are the mid-term and final “exams” to check for success?*
> We can check that the sparsity is calculated correctly and that there is no 
> "unexplainable" differences with current implementation.
>  "Unexplainable" meaning, beside the fact that there will be cases where it 
> will now use the computeDenseVectorCovariance whereas before was using the 
> computeSparseVectorCovariance function and vice versa.
> The tests in the RowMatrixSuite are successful.
>  A couple of tests have been added to verify the sparsity is calculated 
> properly.
> *Appendix A. Proposed API Changes. Optional section defining APIs changes, if 
> any. Backward and forward compatibility must be taken into account.*
> defining a function to calculate the sparsity of the RowMatrix
> {code:java}
>   def calcSparsity(): Double = {
>     rows.map{ vec => (vec.size-vec.numNonzeros) }.sum/
>       (rows.count() * rows.take(1)(0).size).toDouble
>   }
> {code}
> Changing the switching condition. Before was:
> {code:java}
>     if (rows.first().isInstanceOf[DenseVector]) {
>       computeDenseVectorCovariance(mean, n, m)
>     } else {
>       computeSparseVectorCovariance(mean, n, m)
>     }
> {code}
> Proposed new code is:
> {code:java}
>     val sparsityThreshold = 0.5
>     val sparsity = calcSparsity()
>     if (sparsity<sparsityThreshold) {
>       computeDenseVectorCovariance(mean, n, m)
>     } else {
>       computeSparseVectorCovariance(mean, n, m)
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-35598) Improve Spark-ML PCA analysis

Reply via email to