[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

srowen Tue, 27 Nov 2018 05:20:17 -0800

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23126#discussion_r236658063
  
    --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java 
---
    @@ -67,7 +66,7 @@ public void testPCA() {
         JavaRDD<Vector> dataRDD = jsc.parallelize(points, 2);
     
         RowMatrix mat = new RowMatrix(dataRDD.map(
    -        (Vector vector) -> (org.apache.spark.mllib.linalg.Vector) new 
DenseVector(vector.toArray())
    +        (Vector vector) -> 
org.apache.spark.mllib.linalg.Vectors.fromML(vector)
    --- End diff --
    
    This is an interesting point; what if the data is a mix of sparse and 
dense? At least this test now covers that case a bit.
    
    The check above only tests the first element, but that's the only 
reasonable thing to do. If it were, for example, mostly sparse but the first 
happened to be dense, then this change could be problematic. I think it's OK to 
assume that most real-world uses are all sparse or dense. It is possible that 
some data in a sparse case is occasionally encoded as sparse as an optimization 
if it has a lot of zeroes, but this case works fine here.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

Reply via email to