GitHub user falaki opened a pull request:
https://github.com/apache/spark/pull/1595
[Core][SPARK-2696] Reduce default value of
spark.serializer.objectStreamReset
The current default value of spark.serializer.objectStreamReset is 10,000.
When trying to re-partition (e.g., to 64 partitions) a large file (e.g.,
500MB), containing 1MB records, the serializer will cache 10000 x 1MB x 64 ~=
640 GB which will cause out of memory errors.
This patch sets the default value to a more reasonable default value (100).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/falaki/spark objectStreamReset
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1595.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1595
----
commit 1aa0df87db69d3c814b827e27673b198acf49edb
Author: Hossein <[email protected]>
Date: 2014-07-25T22:56:06Z
Reduce default value of spark.serializer.objectStreamReset
commit 650a935cdd810fe7bbc43555ad126cb2bebaab92
Author: Hossein <[email protected]>
Date: 2014-07-25T23:05:05Z
Updated documentation
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---