Github user LeoIV commented on the issue:
https://github.com/apache/spark/pull/18636
At the moment, it is not possible to improve a models accuracy by
incorporating additional data. I think this should be supported since it can
increase a classifiers performance significantly. With this implementation, I
was able to train unsupervised on a Wikipedia Dump, which is pretty large.
However, distributing the set is a good point.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]