[
https://issues.apache.org/jira/browse/SPARK-35598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358659#comment-17358659
]
Apache Spark commented on SPARK-35598:
--------------------------------------
User 'Antozamm' has created a pull request for this issue:
https://github.com/apache/spark/pull/32808
> Improve Spark-ML PCA analysis
> -----------------------------
>
> Key: SPARK-35598
> URL: https://issues.apache.org/jira/browse/SPARK-35598
> Project: Spark
> Issue Type: Improvement
> Components: ML, MLlib
> Affects Versions: 3.1.2
> Reporter: Antonio Zammuto
> Priority: Minor
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely
> no jargon.*
> When performing a PCA, the covariance matrix is calculate first.
> In the function RowMAtrix.computeCovariance() there is the following code:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
> computeDenseVectorCovariance(mean, n, m)
> } else {
> computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> As can be seen, there are two different function to compute the covariance
> matrix depending on whether the matrix is sparse or dense. Now, the decision
> if the matrix is sparse or dense is based only on the type of the first row
> (DenseVector or SparseVector).
> I propose to calculate the sparsity of the matrix, as ratio between
> number-of-zeros and total-number-of elements. And use the sparsity to switch
> between the two functions.
> *Q2. What problem is this proposal NOT designed to solve?*
> The 2 functions give slightly different results. This SPIP doesn't solve this
> issue
> *Q3. How is it done today, and what are the limits of current practice?*
> Basing on the type of the first rows only is questionable.
> Better would be to base on the entire matrix, and on a mathematical concept,
> like sparsity.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The code is implemented in a more meaningful way, the decision is done on the
> entire matrix
> *Q5. Who cares? If you are successful, what difference will it make?*
> More consistent code for PCA analysis, it will be beneficial for who is using
> the PCA analysis
> *Q6. What are the risks?*
> You still need a threshold to decide if the matrix is sparse or not. There is
> no universally defined value to define the sparsity.
> Therefore the threshold of 50% that I choose is arbitrary, and might not be
> the best in all cases.
> Still I consider as an improvement compared to the current implementation.
> *Q7. How long will it take?*
> The change is easy to implement and test
> *Q8. What are the mid-term and final “exams” to check for success?*
> We can check that the sparsity is calculated correctly and that there is no
> "unexplainable" differences with current implementation.
> "Unexplainable" meaning, beside the fact that there will be cases where it
> will now use the computeDenseVectorCovariance whereas before was using the
> computeSparseVectorCovariance function and vice versa.
> The tests in the RowMatrixSuite are successful.
> A couple of tests have been added to verify the sparsity is calculated
> properly.
> *Appendix A. Proposed API Changes. Optional section defining APIs changes, if
> any. Backward and forward compatibility must be taken into account.*
> defining a function to calculate the sparsity of the RowMatrix
> {code:java}
> def calcSparsity(): Double = {
> rows.map{ vec => (vec.size-vec.numNonzeros) }.sum/
> (rows.count() * rows.take(1)(0).size).toDouble
> }
> {code}
> Changing the switching condition. Before was:
> {code:java}
> if (rows.first().isInstanceOf[DenseVector]) {
> computeDenseVectorCovariance(mean, n, m)
> } else {
> computeSparseVectorCovariance(mean, n, m)
> }
> {code}
> Proposed new code is:
> {code:java}
> val sparsityThreshold = 0.5
> val sparsity = calcSparsity()
> if (sparsity<sparsityThreshold) {
> computeDenseVectorCovariance(mean, n, m)
> } else {
> computeSparseVectorCovariance(mean, n, m)
> }
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]