[ 
https://issues.apache.org/jira/browse/SPARK-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11813.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 1.2.3
                   1.5.2
                   1.3.2
                   1.4.2
                   1.1.2
                   1.6.0
                   2.0.0

Issue resolved by pull request 9803
[https://github.com/apache/spark/pull/9803]

> Avoid serialization of vocab in Word2Vec
> ----------------------------------------
>
>                 Key: SPARK-11813
>                 URL: https://issues.apache.org/jira/browse/SPARK-11813
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.5.2
>            Reporter: yuhao yang
>            Assignee: yuhao yang
>            Priority: Minor
>             Fix For: 2.0.0, 1.6.0, 1.1.2, 1.4.2, 1.3.2, 1.5.2, 1.2.3
>
>
> Avoid serialization of vocab in Word2Vec, 2 benefits.
> 1. Performance improvement for less serialization.
> 2. This can actually increase the capacity of Word2Vec. 
> Currently in the fit of word2vec, the closure mainly includes serialization 
> of Word2Vec and 2 global table. 
> The main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 
> vocab;
> 2 global table: vocab * vectorSize * 8.
> Their sum cannot exceed Int.max due to the restriction of 
> ByteArrayOutputStream. In any case, avoiding serialization of vocab helps 
> decrease the size of the closure serialization, especially when vectorSize is 
> small, thus to allow larger vocabulary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to