[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

akopich Tue, 24 Oct 2017 07:10:04 -0700

Github user akopich commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146571987
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -415,7 +415,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer 
with Logging {
           docs: RDD[(Long, Vector)],
           lda: LDA): OnlineLDAOptimizer = {
         this.k = lda.getK
    -    this.corpusSize = docs.count()
    +    this.docs = docs.filter(_._2.numNonzeros > 0) // filter out empty 
documents
    +    this.corpusSize = this.docs.count()
         this.vocabSize = docs.first()._2.size
    --- End diff --
    
    `docs` is assumed to be non-empty, while `this.docs` may be empty. In this 
case `first()` fails.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should fi...

Reply via email to