[jira] [Commented] (SPARK-21742) BisectingKMeans generate different models with/without caching

zhengruifeng (JIRA) Wed, 16 Aug 2017 02:01:55 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128509#comment-16128509
 ]


zhengruifeng commented on SPARK-21742:
--------------------------------------

[~srowen] you are right. When I create the same dataset in a deterministic way, 
the result is deterministic.
This issue rise in my pervious PR https://github.com/apache/spark/pull/16763, 
when I cache the input, test {{Verify Bisecting K-Means does not fail in edge 
case whereone cluster is empty after split}} in {{BisectingKMeansSuite}} always 
fail, for it generate a different model.
So there is no bug here, and I think it maybe necessary to cache the randomly 
generated datasets in test suites.

> BisectingKMeans generate different models with/without caching
> --------------------------------------------------------------
>
>                 Key: SPARK-21742
>                 URL: https://issues.apache.org/jira/browse/SPARK-21742
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: zhengruifeng
>
> I found that {{BisectingKMeans}} will generate different models if the input 
> is cached or not.
> Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we 
> cache the input, then the number of centers will change from 2 to 3.
> So it looks like a potential bug.
> {code}
> import org.apache.spark.ml.param.ParamMap
> import org.apache.spark.sql.Dataset
> import org.apache.spark.ml.clustering._
> import org.apache.spark.ml.linalg._
> import scala.util.Random
> case class TestRow(features: org.apache.spark.ml.linalg.Vector)
> val rows = 10
> val dim = 1000
> val seed = 42
> val random = new Random(seed)
> val nnz = random.nextInt(dim)
> val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, 
> random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, 
> Array.fill(nnz)(random.nextDouble()))).map(v => new TestRow(v))
> val sparseDataset = spark.createDataFrame(rdd)
> val k = 5
> val bkm = new 
> BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res0: Int = 2
> sparseDataset.persist()
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res2: Int = 3
> {code}
> [~imatiach] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21742) BisectingKMeans generate different models with/without caching

Reply via email to