[GitHub] spark pull request: [SPARK-9246] [MLlib] DistributedLDAModel predi...

rotationsymmetry Thu, 30 Jul 2015 22:47:22 -0700

Github user rotationsymmetry commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7769#discussion_r35947957
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
    @@ -361,6 +361,44 @@ class DistributedLDAModel private (
         }
       }
     
    +  /**
    +   * Return the top documents for each topic
    +   *
    +   * This limits the number of documents per topic.
    +   * This is approximate; it may not return exactly the top-weighted 
documents for each topic.
    +   * To get a more precise set of top documents, increase 
maxDocumentsPerTopic.
    +   *
    +   * @param maxDocumentsPerTopic  Maximum number of documents to collect 
for each topic.
    +   * @return  Array over topics.  Each element represent as a pair of 
matching arrays:
    +   *          (indices for the documents, weights of the topic in these 
documents).
    +   *          For each topic, documents are sorted in order of decreasing 
topic weights.
    +   */
    +  def topDocumentsPerTopic(maxDocumentsPerTopic: Int): Array[(Array[Int], 
Array[Double])] = {
    +    val numTopics = k
    +    val topicsInQueues: Array[BoundedPriorityQueue[(Double, Int)]] =
    +      topicDistributions.mapPartitions { docVertices =>
    +        // For this partition, collect the most common docs for each topic 
in queues:
    +        //  queues(topic) = queue of (doc weight, doc index).
    +        val queues =
    +          Array.fill(numTopics)(new BoundedPriorityQueue[(Double, 
Int)](maxDocumentsPerTopic))
    +        for ((docId, docWeight) <- docVertices) {
    --- End diff --
    
    Revised as suggested. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9246] [MLlib] DistributedLDAModel predi...

Reply via email to