Hi Christoph,

Take a look at this, you might end up having a similar case:

http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/

If this is not the case, then I agree with you the kmeans should be
partitioning agnostic (although I haven't check the code yet).

Best,
Anastasios


On Mon, May 22, 2017 at 3:42 PM, Christoph Bruecke <carabo...@gmail.com>
wrote:

> Hi,
>
> I’m trying to figure out how to use KMeans in order to achieve
> reproducible results. I have found that running the same kmeans instance on
> the same data, with different partitioning will produce different
> clusterings.
>
> Given a simple KMeans run with fixed seed returns different results on the
> same
> training data, if the training data is partitioned differently.
>
> Consider the following example. The same KMeans clustering set up is run on
> identical data. The only difference is the partitioning of the training
> data
> (one partition vs. four partitions).
>
> ```
> import org.apache.spark.sql.DataFrame
> import org.apache.spark.ml.clustering.KMeans
> import org.apache.spark.ml.features.VectorAssembler
>
> // generate random data for clustering
> val randomData = spark.range(1, 1000).withColumn("a",
> rand(123)).withColumn("b", rand(321))
>
> val vecAssembler = new VectorAssembler().setInputCols(Array("a",
> "b")).setOutputCol("features")
>
> val data = vecAssembler.transform(randomData)
>
> // instantiate KMeans with fixed seed
> val kmeans = new KMeans().setK(10).setSeed(9876L)
>
> // train the model with different partitioning
> val dataWith1Partition = data.repartition(1)
> println("1 Partition: " + kmeans.fit(dataWith1Partition).computeCost(
> dataWith1Partition))
>
> val dataWith4Partition = data.repartition(4)
> println("4 Partition: " + kmeans.fit(dataWith4Partition).computeCost(
> dataWith4Partition))
> ```
>
> I get the following related cost
>
> ```
> 1 Partition: 16.028212597888057
> 4 Partition: 16.14758460544976
> ```
>
> What I want to achieve is that repeated computations of the KMeans
> Clustering should yield identical result on identical training data,
> regardless of the partitioning.
>
> Looking through the Spark source code, I guess the cause is the
> initialization method of KMeans which in turn uses the `takeSample` method,
> which does not seem to be partition agnostic.
>
> Is this behaviour expected? Is there anything I could do to achieve
> reproducible results?
>
> Best,
> Christoph
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
-- Anastasios Zouzias
<a...@zurich.ibm.com>

Reply via email to