Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/22893#discussion_r231555559 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -521,19 +521,21 @@ object MLUtils extends Logging { * The bound doesn't need the inner product, so we can use it as a sufficient condition to * check quickly whether the inner product approach is accurate. */ - val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) - if (precisionBound1 < precision) { - sqDist = sumSquaredNorm - 2.0 * dot(v1, v2) - } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) { - val dotValue = dot(v1, v2) - sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0) - val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) / - (sqDist + EPSILON) - if (precisionBound2 > precision) { - sqDist = Vectors.sqdist(v1, v2) - } - } else { + if (v1.isInstanceOf[DenseVector] && v2.isInstanceOf[DenseVector]) { sqDist = Vectors.sqdist(v1, v2) + } else { + val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) --- End diff -- Shouldn't you pull computation of things like normDiff into this block, then? That would improve the dense case more, I suppose. I'd still like to see your current final test code to understand what it's testing. If it's a win on some types of vectors, on realistic data, and doesn't hurt performance otherwise, this could be OK.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org