[
https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059753#comment-15059753
]
Yanbo Liang edited comment on SPARK-12363 at 12/16/15 9:38 AM:
---------------------------------------------------------------
After I removed [this
line|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L388],
[this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala#L71]
test cases failed.
I pasted the test cases here. It's very strange that the following test cases
are based on the same dataset, but one success and the other failed.
Another clue is that when use `setInitializationMode("degree")` to train the
PIC model, both the following two cases can pass. But if we use
`setInitializationMode("random")`, the second test case will failed.
{code}
test("power iteration clustering") {
/*
We use the following graph to test PIC. All edges are assigned similarity
1.0 except 0.1 for
edge (3, 4).
15-14 -13 -12
| |
4 . 3 - 2 11
| | x | |
5 0 - 1 10
| |
6 - 7 - 8 - 9
*/
val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0,
3, 1.0), (1, 2, 1.0),
(1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
(4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9,
1.0), (9, 10, 1.0),
(10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))
val model = new PowerIterationClustering()
.setK(2)
.run(sc.parallelize(similarities, 2))
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
val model2 = new PowerIterationClustering()
.setK(2)
.setInitializationMode("degree")
.run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
}
test("power iteration clustering on graph") {
/*
We use the following graph to test PIC. All edges are assigned similarity
1.0 except 0.1 for
edge (3, 4).
15-14 -13 -12
| |
4 . 3 - 2 11
| | x | |
5 0 - 1 10
| |
6 - 7 - 8 - 9
*/
val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0,
3, 1.0), (1, 2, 1.0),
(1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
(4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9,
1.0), (9, 10, 1.0),
(10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))
val edges = similarities.flatMap { case (i, j, s) =>
if (i != j) {
Seq(Edge(i, j, s), Edge(j, i, s))
} else {
None
}
}
val graph = Graph.fromEdges(sc.parallelize(edges, 2), 0.0)
val model = new PowerIterationClustering()
.setK(2)
.run(graph)
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
val model2 = new PowerIterationClustering()
.setK(2)
.setInitializationMode("degree")
.run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
}
{code}
was (Author: yanboliang):
After I removed [this
line|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala#L388],
[this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala#L71]
test cases failed.
It's very strange that the following test cases are the same dataset, but one
success and the other failed.
{code}
test("power iteration clustering") {
/*
We use the following graph to test PIC. All edges are assigned similarity
1.0 except 0.1 for
edge (3, 4).
15-14 -13 -12
| |
4 . 3 - 2 11
| | x | |
5 0 - 1 10
| |
6 - 7 - 8 - 9
*/
val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0,
3, 1.0), (1, 2, 1.0),
(1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
(4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9,
1.0), (9, 10, 1.0),
(10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))
val model = new PowerIterationClustering()
.setK(2)
.run(sc.parallelize(similarities, 2))
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
val model2 = new PowerIterationClustering()
.setK(2)
.setInitializationMode("degree")
.run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
}
test("power iteration clustering on graph") {
/*
We use the following graph to test PIC. All edges are assigned similarity
1.0 except 0.1 for
edge (3, 4).
15-14 -13 -12
| |
4 . 3 - 2 11
| | x | |
5 0 - 1 10
| |
6 - 7 - 8 - 9
*/
val similarities = Seq[(Long, Long, Double)]((0, 1, 1.0), (0, 2, 1.0), (0,
3, 1.0), (1, 2, 1.0),
(1, 3, 1.0), (2, 3, 1.0), (3, 4, 0.1), // (3, 4) is a weak edge
(4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0), (6, 7, 1.0), (7, 8, 1.0), (8, 9,
1.0), (9, 10, 1.0),
(10, 11, 1.0), (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0))
val edges = similarities.flatMap { case (i, j, s) =>
if (i != j) {
Seq(Edge(i, j, s), Edge(j, i, s))
} else {
None
}
}
val graph = Graph.fromEdges(sc.parallelize(edges, 2), 0.0)
val model = new PowerIterationClustering()
.setK(2)
.run(graph)
val predictions = Array.fill(2)(mutable.Set.empty[Long])
model.assignments.collect().foreach { a =>
predictions(a.cluster) += a.id
}
assert(predictions.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
val model2 = new PowerIterationClustering()
.setK(2)
.setInitializationMode("degree")
.run(sc.parallelize(similarities, 2))
val predictions2 = Array.fill(2)(mutable.Set.empty[Long])
model2.assignments.collect().foreach { a =>
predictions2(a.cluster) += a.id
}
assert(predictions2.toSet == Set((0 to 3).toSet, (4 to 15).toSet))
}
{code}
> PowerIterationClustering test case failed if we deprecated KMeans.setRuns
> -------------------------------------------------------------------------
>
> Key: SPARK-12363
> URL: https://issues.apache.org/jira/browse/SPARK-12363
> Project: Spark
> Issue Type: Bug
> Components: GraphX, MLlib
> Reporter: Yanbo Liang
>
> We plan to deprecated `runs` of KMeans, PowerIterationClustering will
> leverage KMeans to train model.
> I removed `setRuns` used in PowerIterationClustering, but one of the test
> cases failed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]