Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r236658063 --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java --- @@ -67,7 +66,7 @@ public void testPCA() { JavaRDD<Vector> dataRDD = jsc.parallelize(points, 2); RowMatrix mat = new RowMatrix(dataRDD.map( - (Vector vector) -> (org.apache.spark.mllib.linalg.Vector) new DenseVector(vector.toArray()) + (Vector vector) -> org.apache.spark.mllib.linalg.Vectors.fromML(vector) --- End diff -- This is an interesting point; what if the data is a mix of sparse and dense? At least this test now covers that case a bit. The check above only tests the first element, but that's the only reasonable thing to do. If it were, for example, mostly sparse but the first happened to be dense, then this change could be problematic. I think it's OK to assume that most real-world uses are all sparse or dense. It is possible that some data in a sparse case is occasionally encoded as sparse as an optimization if it has a lot of zeroes, but this case works fine here.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org