雷文昌 created SPARK-15399:
---------------------------
Summary: Wrong equation in the method of
org.apache.spark.mllib.clustering.KMeans
Key: SPARK-15399
URL: https://issues.apache.org/jira/browse/SPARK-15399
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.6.1
Environment: windows 64bit
Reporter: 雷文昌
the equation |a-b|=||a|-|b|| is wrong when a and b are vector. but it is used
in the spark-1.6.1.
private[mllib] def findClosest(
centers: TraversableOnce[VectorWithNorm],
point: VectorWithNorm): (Int, Double) = {
var bestDistance = Double.PositiveInfinity
var bestIndex = 0
var i = 0
centers.foreach { center =>
// Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to
avoid unnecessary
// distance computation.
var lowerBoundOfSqDist = center.norm - point.norm
lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
if (lowerBoundOfSqDist < bestDistance) {
val distance: Double = fastSquaredDistance(center, point)
if (distance < bestDistance) {
bestDistance = distance
bestIndex = i
}
}
i += 1
}
(bestIndex, bestDistance)
}
the center and the point in the source code are vector. and I suggest the code
is that
private[mllib] def findClosest(
centers: TraversableOnce[VectorWithNorm],
point: VectorWithNorm): (Int, Double) = {
var bestDistance = Double.PositiveInfinity
var bestIndex = 0
var i = 0
centers.foreach { center =>
// distance computation.
val distance: Double = fastSquaredDistance(center, point)
if (distance < bestDistance) {
bestDistance = distance
bestIndex = i
}
i += 1
}
(bestIndex, bestDistance)
}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]