Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19565#discussion_r146569259
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
    @@ -415,7 +415,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer 
with Logging {
           docs: RDD[(Long, Vector)],
           lda: LDA): OnlineLDAOptimizer = {
         this.k = lda.getK
    -    this.corpusSize = docs.count()
    +    this.docs = docs.filter(_._2.numNonzeros > 0) // filter out empty 
documents
    +    this.corpusSize = this.docs.count()
         this.vocabSize = docs.first()._2.size
    --- End diff --
    
    Needs to be `this.docs` too


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to