[ https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130045#comment-16130045 ]
zhengruifeng commented on SPARK-21742: -------------------------------------- [~srowen] I cache the dataset in that test, then it fails. So something maybe wrong in the tast case. (I will test it again with a deterministicly generated dataset.) And if the randomly generated datasets in testsuites are not deterministic (I will check this point), I think we should persisit all test datasets. > BisectingKMeans generate different models with/without caching > -------------------------------------------------------------- > > Key: SPARK-21742 > URL: https://issues.apache.org/jira/browse/SPARK-21742 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.0 > Reporter: zhengruifeng > > I found that {{BisectingKMeans}} will generate different models if the input > is cached or not. > Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we > cache the input, then the number of centers will change from 2 to 3. > So it looks like a potential bug. > {code} > import org.apache.spark.ml.param.ParamMap > import org.apache.spark.sql.Dataset > import org.apache.spark.ml.clustering._ > import org.apache.spark.ml.linalg._ > import scala.util.Random > case class TestRow(features: org.apache.spark.ml.linalg.Vector) > val rows = 10 > val dim = 1000 > val seed = 42 > val nnz = 130 > val bkm = new > BisectingKMeans().setK(5).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123) > val random = new Random(seed) > val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, > random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, > Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v)) > val sparseDataset = spark.createDataFrame(rdd) > scala> bkm.fit(sparseDataset).clusterCenters > 17/08/16 17:12:28 WARN BisectingKMeans: The input RDD 579 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res22: Array[org.apache.spark.ml.linalg.Vector] = > Array([0.0,0.0,0.0,0.0,0.0,0.0,0.3081569145071915,0.0,0.0,0.0,0.0,0.1875176493190393,0.0,0.0,0.0,0.33856517726920116,0.0,0.15290274761955236,0.0,0.10820818064086901,0.0,0.0,0.5987249128746422,0.0,0.0,0.3563390364518392,0.0,0.5019914247361699,0.0,0.08711412551574785,0.09199053071837167,0.05749771404790841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5209441786832834,0.0,0.2350595158678447,0.0,0.0,0.0,0.0,0.0,0.0,0.3041334669892575,0.0,0.0,0.32422664760898434,0.0,0.24542718129722224,0.0,0.0,0.06846136418797384,0.0,0.0,0.19556839035017104,0.0,0.0,0.08436120694800427,0.0,0.0,0.0,0.30542501045554465,0.0,0.0,0.0,0.16185204843664616,0.2800921624973247,0.0,0.45459861318444555,0.0,0.0,0.0,0.26222502250076374,0.5235099131919367,0.0,0.0,0.... > scala> bkm.fit(sparseDataset).clusterCenters.length > 17/08/16 17:12:36 WARN BisectingKMeans: The input RDD 667 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res23: Int = 2 > scala> sparseDataset.persist() > res24: sparseDataset.type = [features: vector] > scala> bkm.fit(sparseDataset).clusterCenters > 17/08/16 17:14:35 WARN BisectingKMeans: The input RDD 806 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res26: Array[org.apache.spark.ml.linalg.Vector] = > Array([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.562552947957118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.32462454192260704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26134237654724357,0.275971592155115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9124004009677724,0.0,0.0,0.972679942826953,0.0,0.7362815438916668,0.0,0.0,0.20538409256392154,0.0,0.0,0.5867051710505131,0.0,0.0,0.0,0.0,0.0,0.0,0.916275031366634,0.0,0.0,0.0,0.4855561453099385,0.0,0.0,0.0,0.0,0.0,0.0,0.7866750675022912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6178027906951924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.97254915644181,0.0,0.0,0.0,0.0,0.0,0.7947673417631961,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9685267297437855,0.0,0.0,0.0,0.1... > scala> bkm.fit(sparseDataset).clusterCenters.length > 17/08/16 17:14:38 WARN BisectingKMeans: The input RDD 855 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res27: Int = 3 > {code} > And suggested by [~srowen], I retest it with the same dataset generated in a > deterministic way, now the results are the same. > {code} > val random = new Random(seed) > val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, > random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, > Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v)) > val vecs = rdd.collect() > val rdd2 = sc.parallelize(vecs) > val sparseDataset2 = spark.createDataFrame(rdd2) > scala> bkm.fit(sparseDataset2).clusterCenters.length > 17/08/16 17:20:36 WARN BisectingKMeans: The input RDD 1114 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res35: Int = 3 > scala> bkm.fit(sparseDataset2).clusterCenters > 17/08/16 17:20:43 WARN BisectingKMeans: The input RDD 1164 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res36: Array[org.apache.spark.ml.linalg.Vector] = > Array([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.562552947957118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.32462454192260704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26134237654724357,0.275971592155115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9124004009677724,0.0,0.0,0.972679942826953,0.0,0.7362815438916668,0.0,0.0,0.20538409256392154,0.0,0.0,0.5867051710505131,0.0,0.0,0.0,0.0,0.0,0.0,0.916275031366634,0.0,0.0,0.0,0.4855561453099385,0.0,0.0,0.0,0.0,0.0,0.0,0.7866750675022912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6178027906951924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.97254915644181,0.0,0.0,0.0,0.0,0.0,0.7947673417631961,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9685267297437855,0.0,0.0,0.0,0.1... > scala> sparseDataset2.persist() > res37: sparseDataset2.type = [features: vector] > scala> bkm.fit(sparseDataset2).clusterCenters.length > 17/08/16 17:20:54 WARN BisectingKMeans: The input RDD 1216 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res38: Int = 3 > scala> bkm.fit(sparseDataset2).clusterCenters > 17/08/16 17:20:58 WARN BisectingKMeans: The input RDD 1265 is not directly > cached, which may hurt performance if its parent RDDs are also not cached. > res39: Array[org.apache.spark.ml.linalg.Vector] = > Array([0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.562552947957118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.32462454192260704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.26134237654724357,0.275971592155115,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9124004009677724,0.0,0.0,0.972679942826953,0.0,0.7362815438916668,0.0,0.0,0.20538409256392154,0.0,0.0,0.5867051710505131,0.0,0.0,0.0,0.0,0.0,0.0,0.916275031366634,0.0,0.0,0.0,0.4855561453099385,0.0,0.0,0.0,0.0,0.0,0.0,0.7866750675022912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6178027906951924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.97254915644181,0.0,0.0,0.0,0.0,0.0,0.7947673417631961,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9685267297437855,0.0,0.0,0.0,0.1... > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org