srowen commented on a change in pull request #26415: [SPARK-18409][ML] LSH
approxNearestNeighbors should use approxQuantile instead of sort
URL: https://github.com/apache/spark/pull/26415#discussion_r348886172
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala
##########
@@ -112,7 +112,9 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
numNearestNeighbors: Int,
singleProbe: Boolean,
distCol: String): Dataset[_] = {
- require(numNearestNeighbors > 0, "The number of nearest neighbors cannot
be less than 1")
+ val count = dataset.count()
Review comment:
Fair point, but it's hardly the most expensive operation here, before or
after. I'd expect a caller has to cache the input for multiple calls to make
sense anyway. I think the count + approxQuantile still beats sort + take.
I also take the point about take() but relaxing the requirement won't remove
the need for count() here. Does it make more sense semantically? You're right
it would be more consistent with the original implementation. Hm. I guess I'm
neutral on it, but it's a valid question.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]