[GitHub] spark issue #16355: [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm faili...

imatiach-msft Wed, 28 Dec 2016 08:07:46 -0800

Github user imatiach-msft commented on the issue:

    https://github.com/apache/spark/pull/16355
  
    I have very good news :).  I was not only able to repro the issue with your 
dataset, but I was also able to verify that with the suggested fix the 
algorithm does not fail (adding the val newClusterChildren = 
children.filter(newClusterCenters.contains(_)) fixed the issue).
    The error I saw was:
    Job aborted due to stage failure: Task 0 in stage 53.0 failed 1 times, most 
recent failure: Lost task 0.0 in stage 53.0 (TID 105, localhost, executor 
driver): java.util.NoSuchElementException: key not found: 162
    
    I need to figure out how to actually add the test case to spark though - my 
understanding is that checking in dataset files is not allowed?  My test code 
(not cleaned up yet) was:
    
    import org.apache.spark.mllib.linalg.{
    Vector => OldVector,
    Vectors => OldVectors,
    DenseVector => OldDenseVector,
    SparseVector => OldSparseVector}
    
    import org.apache.spark.ml.linalg.{
    Vector => NewVector,
    Vectors => NewVectors,
    DenseVector => NewDenseVector,
    SparseVector => NewSparseVector
    }
    
      @transient var loadedDataset: Dataset[_] = _
    
      override def beforeAll(): Unit = {
        super.beforeAll()
        dataset = KMeansSuite.generateKMeansData(spark, 50, 3, k)
        loadedDataset = 
spark.createDataFrame(sc.textFile("/home/ilya/SampleVectors.txt")
          .map(line => Row({ OldVectors.parse(line.substring(1, line.length - 
1)) match {
            case sv: OldSparseVector => NewVectors.sparse(sv.size, sv.indices, 
sv.values)
            case dv: OldDenseVector => NewVectors.dense(dv.values)
          } })),
            StructType(Seq(StructField("features", SQLDataTypes.VectorType))))
      }
    .....
      test("SPARK-16473: Verify Bisecting K-Means does not fail in edge case 
where no children exist") {
        val bkm = new 
BisectingKMeans().setK(100).setMinDivisibleClusterSize(4).setMaxIter(4)
    
        assert(bkm.getK === 100)
        assert(bkm.getFeaturesCol === "features")
        assert(bkm.getPredictionCol === "prediction")
        assert(bkm.getMaxIter === 4)
        assert(bkm.getMinDivisibleClusterSize === 4)
        val model = bkm.fit(loadedDataset)
        assert(model.hasSummary)
      }



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16355: [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm faili...

Reply via email to