[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089424#comment-16089424 ]
Peng Meng commented on SPARK-21401: ----------------------------------- Hi [~srowen], for ALS optimization, the difference of using poll and sorted is large: When num = 20, if use sorted here, the prediction time is about 31s, if use poll, the prediction time is about 26s. I think this difference is large. I have tested many times. The result is about the same. https://github.com/apache/spark/pull/18624. I am testing LDA of changing sorted to poll. > add poll function for BoundedPriorityQueue > ------------------------------------------ > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 2.3.0 > Reporter: Peng Meng > Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org