[GitHub] [spark] srowen commented on a change in pull request #26415: [SPARK-18409][ML] LSH approxNearestNeighbors should use approxQuantile instead of sort

GitBox Wed, 20 Nov 2019 19:53:55 -0800

srowen commented on a change in pull request #26415: [SPARK-18409][ML] LSH 
approxNearestNeighbors should use approxQuantile instead of sort
URL: https://github.com/apache/spark/pull/26415#discussion_r348886172


 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala
 ##########
 @@ -112,7 +112,9 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
       numNearestNeighbors: Int,
       singleProbe: Boolean,
       distCol: String): Dataset[_] = {
-    require(numNearestNeighbors > 0, "The number of nearest neighbors cannot 
be less than 1")
+    val count = dataset.count()
 
 Review comment:
   Fair point, but it's hardly the most expensive operation here, before or 
after. I'd expect a caller has to cache the input for multiple calls to make 
sense anyway. I think the count + approxQuantile still beats sort + take.
   
   I also take the point about take() but relaxing the requirement won't remove 
the need for count() here. Does it make more sense semantically? You're right 
it would be more consistent with the original implementation. Hm. I guess I'm 
neutral on it, but it's a valid question.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] srowen commented on a change in pull request #26415: [SPARK-18409][ML] LSH approxNearestNeighbors should use approxQuantile instead of sort

Reply via email to