GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/9878

    [SPARK-11898] [MLlib] Use broadcast for the global tables in Word2Vec

    jira: https://issues.apache.org/jira/browse/SPARK-11898 
    syn0Global and sync1Global in word2vec are quite large objects with size 
(vocab * vectorSize * 8), yet they are passed to worker using basic task 
serialization.
    
    Use broadcast can greatly improve the performance. My benchmark shows that, 
for 1M vocabulary and default vectorSize 100, changing to broadcast can help,
    
    1. decrease the worker memory consumption by 45%.
    2. decrease running time by 40%.
    
    This will also help extend the upper limit for Word2Vec.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark w2vBC

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9878.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9878
    
----
commit cee80c04bc709d9e4730ab1cad69b9b075738a75
Author: Yuhao Yang <[email protected]>
Date:   2015-11-21T03:34:52Z

    broadcast global table

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to