Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23126#discussion_r236658063
  
    --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java 
---
    @@ -67,7 +66,7 @@ public void testPCA() {
         JavaRDD<Vector> dataRDD = jsc.parallelize(points, 2);
     
         RowMatrix mat = new RowMatrix(dataRDD.map(
    -        (Vector vector) -> (org.apache.spark.mllib.linalg.Vector) new 
DenseVector(vector.toArray())
    +        (Vector vector) -> 
org.apache.spark.mllib.linalg.Vectors.fromML(vector)
    --- End diff --
    
    This is an interesting point; what if the data is a mix of sparse and 
dense? At least this test now covers that case a bit.
    
    The check above only tests the first element, but that's the only 
reasonable thing to do. If it were, for example, mostly sparse but the first 
happened to be dense, then this change could be problematic. I think it's OK to 
assume that most real-world uses are all sparse or dense. It is possible that 
some data in a sparse case is occasionally encoded as sparse as an optimization 
if it has a lot of zeroes, but this case works fine here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to