Github user akopich commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r146571987
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -415,7 +415,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer
with Logging {
docs: RDD[(Long, Vector)],
lda: LDA): OnlineLDAOptimizer = {
this.k = lda.getK
- this.corpusSize = docs.count()
+ this.docs = docs.filter(_._2.numNonzeros > 0) // filter out empty
documents
+ this.corpusSize = this.docs.count()
this.vocabSize = docs.first()._2.size
--- End diff --
`docs` is assumed to be non-empty, while `this.docs` may be empty. In this
case `first()` fails.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]