GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/9803

    [SPARK-11813] [MLlib] Avoid serialization of vocab in Word2Vec

    jira: https://issues.apache.org/jira/browse/SPARK-11813 
    
    I found the problem during training a large corpus. Avoid serialization of 
vocab in Word2Vec has 2 benefits.
    1. Performance improvement for less serialization.
    2. Increase the capacity of Word2Vec a lot. 
    Currently in the fit of word2vec, the closure mainly includes serialization 
of Word2Vec and 2 global table. 
    the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 
vocab
    2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 
vocab.
    
    Their sum cannot exceed Int.max due to the restriction of 
ByteArrayOutputStream. In any case, avoiding serialization of vocab helps 
decrease the size of the closure serialization, especially when vectorSize is 
small, thus to allow larger vocabulary.
    
    Actually there's another possible fix, make local copy of fields to avoid 
including Word2Vec in the closure. Let me know if that's preferred.
    
    
    
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark w2vVocab

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9803.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9803
    
----
commit 028138a3d11e5d38fe23d0d286c833fc0f005e5f
Author: Yuhao Yang <[email protected]>
Date:   2015-11-18T09:03:24Z

    make vocab transient

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to