Hi Christoph, Take a look at this, you might end up having a similar case:
http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/ If this is not the case, then I agree with you the kmeans should be partitioning agnostic (although I haven't check the code yet). Best, Anastasios On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke <carabo...@gmail.com> wrote: > Hi, > > I’m trying to figure out how to use KMeans in order to achieve > reproducible results. I have found that running the same kmeans instance on > the same data, with different partitioning will produce different > clusterings. > > Given a simple KMeans run with fixed seed returns different results on the > same > training data, if the training data is partitioned differently. > > Consider the following example. The same KMeans clustering set up is run on > identical data. The only difference is the partitioning of the training > data > (one partition vs. four partitions). > > ``` > import org.apache.spark.sql.DataFrame > import org.apache.spark.ml.clustering.KMeans > import org.apache.spark.ml.features.VectorAssembler > > // generate random data for clustering > val randomData = spark.range(1, 1000).withColumn("a", > rand(123)).withColumn("b", rand(321)) > > val vecAssembler = new VectorAssembler().setInputCols(Array("a", > "b")).setOutputCol("features") > > val data = vecAssembler.transform(randomData) > > // instantiate KMeans with fixed seed > val kmeans = new KMeans().setK(10).setSeed(9876L) > > // train the model with different partitioning > val dataWith1Partition = data.repartition(1) > println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost( > dataWith1Partition)) > > val dataWith4Partition = data.repartition(4) > println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost( > dataWith4Partition)) > ``` > > I get the following related cost > > ``` > 1 Partition: 16.028212597888057 > 4 Partition: 16.14758460544976 > ``` > > What I want to achieve is that repeated computations of the KMeans > Clustering should yield identical result on identical training data, > regardless of the partitioning. > > Looking through the Spark source code, I guess the cause is the > initialization method of KMeans which in turn uses the `takeSample` method, > which does not seem to be partition agnostic. > > Is this behaviour expected? Is there anything I could do to achieve > reproducible results? > > Best, > Christoph > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>