Github user tmnd1991 commented on the pull request:

    https://github.com/apache/spark/pull/9989#issuecomment-159866134
  
    I explained it on the Jira issue, I'm going to explain it again here:
    Since `spark.kryoserializer.buffer.max` defaults to 64MB, I decided to 
increase the number of partitions the model gets divided into at half that size 
(32MB).
    One word2vec entry consists of an array of float of size vectorSize and a 
string, since the size of string is variable and considerably lower than the 
size of the array, I'm not going consider it in my size computation.
    The number of partitions the model gets splitted into is given by the 
formulae: 
    `(4 * numWords * vectorSize / 33554432) + 1`
    Where 4 is float size, numberWords and vectorSize are respectively the 
number of words the model contains and the size of each array, and 33554432 is 
32MB in bytes.
    One more sophisticated solution would be to read the 
`spark.kryoserializer.buffer.max` value at runtime, but it would be kinda 
meaningless, because when we're saving the model we're not sure that the 
property value will be the same when we load it. It can be a different Spark 
application. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to