GitHub user Ishiihara opened a pull request:

    https://github.com/apache/spark/pull/1871

    [SPARK-2907] [MLlib] Use mutable.HashMap to represent model in Word2Vec

    Change list:
    1. Used mutable.HashMap to represent syn0Global and syn1Global to reduce 
shuffle size.
    2. Introduced local vocabulary to perform more precise learning rate 
update. 
    3. Replace layer1Size with vectorSize to correctly set vector size.  
Previously, layer1Size was always the default value of vectorSize. 
    
    For 100 partitions,  using mutable.HashMap reduces shuffle size from 8.1G 
to 4G. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Ishiihara/spark Word2Vec-improve

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1871.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1871
    
----
commit 8d6befe21e26cc843fc96e4c2934a15c0797ce51
Author: Liquan Pei <[email protected]>
Date:   2014-08-01T07:45:22Z

    initial commit

commit 0aafb1b02a19fe4f1689543baf1882a49a7ff11a
Author: Liquan Pei <[email protected]>
Date:   2014-08-01T15:34:11Z

    Add comments, minor fixes

commit e4a04d32be284f9a7ab2d3f57d745342912930a7
Author: Liquan Pei <[email protected]>
Date:   2014-08-01T15:46:38Z

    minor fix

commit 57dc50d3f24beda8eb0348c0baf8dc343065fd2d
Author: Liquan Pei <[email protected]>
Date:   2014-08-01T16:20:10Z

    code formatting

commit 2e92b5991ad8f3f73bbeab9a056f452c4b532b3c
Author: Liquan Pei <[email protected]>
Date:   2014-08-02T01:17:38Z

    modify according to feedback

commit 720b5a3ea697a881fc7d7c286b65ef110421f89e
Author: Liquan Pei <[email protected]>
Date:   2014-08-02T05:53:03Z

    Add test for Word2Vec algorithm, minor fixes

commit 6bcc8be34f6253bc7d4f9d4dcb478bf91f108c86
Author: Liquan Pei <[email protected]>
Date:   2014-08-03T18:15:09Z

    add multiple iteration support

commit 7efbb6f91ca94f9243dbb7a16ea3fc9b6f548b99
Author: Liquan Pei <[email protected]>
Date:   2014-08-03T19:16:19Z

    use broadcast version of vocab in aggregate

commit 1a8fb4127b9433945e75beea16fc2d485a249219
Author: Liquan Pei <[email protected]>
Date:   2014-08-03T23:24:35Z

    use weighted sum in combOp

commit e93e7263d74879379257e6fff40d5efc8417f2ce
Author: Liquan Pei <[email protected]>
Date:   2014-08-04T03:53:21Z

    use treeAggregate instead of aggregate

commit 384c77185544d6f80de96bd366e19760eacbd936
Author: Xiangrui Meng <[email protected]>
Date:   2014-08-04T04:33:05Z

    remove minCount and window from constructor
    change model to use float instead of double

commit c14da411d4da1b6553759afff7952ac746c9fa15
Author: Xiangrui Meng <[email protected]>
Date:   2014-08-04T05:09:58Z

    fix styles

commit 26a948d7e4b8f8cbc91cc7db5cf0acc7d6f08131
Author: Liquan Pei <[email protected]>
Date:   2014-08-04T05:15:27Z

    Merge pull request #1 from mengxr/Ishiihara-master
    
    some updates

commit e2484414d65c3b8aebffa79c3cac34452cf53d38
Author: Liquan Pei <[email protected]>
Date:   2014-08-04T05:47:53Z

    minor style change

commit 2ba948384e96e79e95a529f032d4768f24236547
Author: Liquan Pei <[email protected]>
Date:   2014-08-04T05:59:40Z

    minor fix for Word2Vec test

commit 74b647b3edb87212c57cf6c5e77d627b0aebb67f
Author: Liquan Pei <[email protected]>
Date:   2014-08-07T00:28:53Z

    confict resolution

commit e73fd4c8688cc7bbbf49fa68456fb1c83a29d0e6
Author: Liquan Pei <[email protected]>
Date:   2014-08-10T03:44:15Z

    Merge remote-tracking branch 'upstream/master'

commit a8ccea59e65708d1be708a602369084b90c6fc49
Author: Liquan Pei <[email protected]>
Date:   2014-08-10T04:44:17Z

    use mutable.HashMap to represent model

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to