[ https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128427#comment-16128427 ]
zhengruifeng commented on SPARK-21742: -------------------------------------- [~srowen] I set the seed for generate dataset and training the BiKMeans, so this results should be deterministic. The only difference lies in whether the input dataset is cached. > BisectingKMeans generate different models with/without caching > -------------------------------------------------------------- > > Key: SPARK-21742 > URL: https://issues.apache.org/jira/browse/SPARK-21742 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.0 > Reporter: zhengruifeng > > I found that {{BisectingKMeans}} will generate different models if the input > is cached or not. > Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we > cache the input, then the number of centers will change from 2 to 3. > So it looks like a potential bug. > {code} > import org.apache.spark.ml.param.ParamMap > import org.apache.spark.sql.Dataset > import org.apache.spark.ml.clustering._ > import org.apache.spark.ml.linalg._ > import scala.util.Random > case class TestRow(features: org.apache.spark.ml.linalg.Vector) > val rows = 10 > val dim = 1000 > val seed = 42 > val random = new Random(seed) > val nnz = random.nextInt(dim) > val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, > random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, > Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v)) > val sparseDataset = spark.createDataFrame(rdd) > val k = 5 > val bkm = new > BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123) > val model = bkm.fit(sparseDataset) > model.clusterCenters.length > res0: Int = 2 > sparseDataset.persist() > val model = bkm.fit(sparseDataset) > model.clusterCenters.length > res2: Int = 3 > {code} > [~imatiach] -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org