Github user rotationsymmetry commented on a diff in the pull request:
https://github.com/apache/spark/pull/7769#discussion_r35947957
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
@@ -361,6 +361,44 @@ class DistributedLDAModel private (
}
}
+ /**
+ * Return the top documents for each topic
+ *
+ * This limits the number of documents per topic.
+ * This is approximate; it may not return exactly the top-weighted
documents for each topic.
+ * To get a more precise set of top documents, increase
maxDocumentsPerTopic.
+ *
+ * @param maxDocumentsPerTopic Maximum number of documents to collect
for each topic.
+ * @return Array over topics. Each element represent as a pair of
matching arrays:
+ * (indices for the documents, weights of the topic in these
documents).
+ * For each topic, documents are sorted in order of decreasing
topic weights.
+ */
+ def topDocumentsPerTopic(maxDocumentsPerTopic: Int): Array[(Array[Int],
Array[Double])] = {
+ val numTopics = k
+ val topicsInQueues: Array[BoundedPriorityQueue[(Double, Int)]] =
+ topicDistributions.mapPartitions { docVertices =>
+ // For this partition, collect the most common docs for each topic
in queues:
+ // queues(topic) = queue of (doc weight, doc index).
+ val queues =
+ Array.fill(numTopics)(new BoundedPriorityQueue[(Double,
Int)](maxDocumentsPerTopic))
+ for ((docId, docWeight) <- docVertices) {
--- End diff --
Revised as suggested. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]