srowen commented on a change in pull request #26415: [SPARK-18409][ML] LSH
approxNearestNeighbors should use approxQuantile instead of sort
URL: https://github.com/apache/spark/pull/26415#discussion_r344293861
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala
##########
@@ -138,13 +143,13 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
val hashDistCol = hashDistUDF(col($(outputCol)))
// Compute threshold to get exact k elements.
- // TODO: SPARK-18409: Use approxQuantile to get the threshold
- val modelDatasetSortedByHash =
modelDataset.sort(hashDistCol).limit(numNearestNeighbors)
- val thresholdDataset = modelDatasetSortedByHash.select(max(hashDistCol))
- val hashThreshold = thresholdDataset.take(1).head.getDouble(0)
+ val quantile = numNearestNeighbors.toDouble / modelDataset.count()
+ val modelDatasetWithDist = modelDataset.withColumn(distCol,
hashDistUDF(col($(outputCol))))
+ val hashThreshold = modelDatasetWithDist.stat
+ .approxQuantile(distCol, Array(quantile), $(relativeError))
// Filter the dataset where the hash value is less than the threshold.
- modelDataset.filter(hashDistCol <= hashThreshold)
+ modelDatasetWithDist.filter(hashDistCol <= hashThreshold(0))
Review comment:
Hm, I guess it's also possible we get too few nearest neighbors. This is
probably especially likely as the quantile is small. It may be a good idea to
request too many nearest neighbors, to make the likelihood of returning too few
pretty small.
On that note, is it meaningful to expose relativeError to callers? they want
a number of nearest neighbors, not more or less. This is a pretty internal
implementation detail. How about simply setting some fixed value, plus
oversampling, which should virtually always give enough results yet gets some
efficiency gains? I don't know what that value is; might bear a little testing.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]