spark git commit: [SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec

meng Wed, 18 Nov 2015 13:26:28 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 23b8c2256 -> 18e308b84



[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec

jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of 
vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of 
Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of 
ByteArrayOutputStream. In any case, avoiding serialization of vocab helps 
decrease the size of the closure serialization, especially when vectorSize is 
small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid 
including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <[email protected]>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abdf2cb6098a35347bd123b815ee9ac5b689)
Signed-off-by: Xiangrui Meng <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18e308b8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18e308b8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18e308b8

Branch: refs/heads/branch-1.6
Commit: 18e308b84fe7ffeca730397152582b31a4b88a82
Parents: 23b8c22
Author: Yuhao Yang <[email protected]>
Authored: Wed Nov 18 13:25:15 2015 -0800
Committer: Xiangrui Meng <[email protected]>
Committed: Wed Nov 18 13:25:22 2015 -0800

----------------------------------------------------------------------
 .../src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/18e308b8/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index f3e4d34..7ab0d89 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -145,8 +145,8 @@ class Word2Vec extends Serializable with Logging {
 
   private var trainWordsCount = 0
   private var vocabSize = 0
-  private var vocab: Array[VocabWord] = null
-  private var vocabHash = mutable.HashMap.empty[String, Int]
+  @transient private var vocab: Array[VocabWord] = null
+  @transient private var vocabHash = mutable.HashMap.empty[String, Int]
 
   private def learnVocab(words: RDD[String]): Unit = {
     vocab = words.map(w => (w, 1))


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec

Reply via email to