GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/9803
[SPARK-11813] [MLlib] Avoid serialization of vocab in Word2Vec
jira: https://issues.apache.org/jira/browse/SPARK-11813
I found the problem during training a large corpus. Avoid serialization of
vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization
of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320
vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160
vocab.
Their sum cannot exceed Int.max due to the restriction of
ByteArrayOutputStream. In any case, avoiding serialization of vocab helps
decrease the size of the closure serialization, especially when vectorSize is
small, thus to allow larger vocabulary.
Actually there's another possible fix, make local copy of fields to avoid
including Word2Vec in the closure. Let me know if that's preferred.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hhbyyh/spark w2vVocab
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9803.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9803
----
commit 028138a3d11e5d38fe23d0d286c833fc0f005e5f
Author: Yuhao Yang <[email protected]>
Date: 2015-11-18T09:03:24Z
make vocab transient
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]