[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089209#comment-16089209 ]
Peng Meng commented on SPARK-21401: ----------------------------------- Hi [~srowen], here we also want to get a fully sorted list by get the the top element N (pq.size) times. In my test, call pq.poll N times is much faster than pq.toArray.sorted. > add poll function for BoundedPriorityQueue > ------------------------------------------ > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 2.3.0 > Reporter: Peng Meng > Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org