Repository: spark
Updated Branches:
  refs/heads/master cf3e9fd84 -> 66f26a461


[SPARK-2696] Reduce default value of spark.serializer.objectStreamReset

The current default value of spark.serializer.objectStreamReset is 10,000.
When trying to re-partition (e.g., to 64 partitions) a large file (e.g., 
500MB), containing 1MB records, the serializer will cache 10000 x 1MB x 64 ~= 
640 GB which will cause out of memory errors.

This patch sets the default value to a more reasonable default value (100).

Author: Hossein <[email protected]>

Closes #1595 from falaki/objectStreamReset and squashes the following commits:

650a935 [Hossein] Updated documentation
1aa0df8 [Hossein] Reduce default value of spark.serializer.objectStreamReset


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/66f26a46
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/66f26a46
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/66f26a46

Branch: refs/heads/master
Commit: 66f26a4610aede57322cb7e193a50aecb6c57d22
Parents: cf3e9fd
Author: Hossein <[email protected]>
Authored: Sat Jul 26 01:04:56 2014 -0700
Committer: Matei Zaharia <[email protected]>
Committed: Sat Jul 26 01:04:56 2014 -0700

----------------------------------------------------------------------
 .../main/scala/org/apache/spark/serializer/JavaSerializer.scala  | 2 +-
 docs/configuration.md                                            | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/66f26a46/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala
----------------------------------------------------------------------
diff --git 
a/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala 
b/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala
index 0a7e1ec..a7fa057 100644
--- a/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala
+++ b/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala
@@ -108,7 +108,7 @@ private[spark] class JavaSerializerInstance(counterReset: 
Int) extends Serialize
  */
 @DeveloperApi
 class JavaSerializer(conf: SparkConf) extends Serializer with Externalizable {
-  private var counterReset = conf.getInt("spark.serializer.objectStreamReset", 
10000)
+  private var counterReset = conf.getInt("spark.serializer.objectStreamReset", 
100)
 
   def newInstance(): SerializerInstance = new 
JavaSerializerInstance(counterReset)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/66f26a46/docs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/configuration.md b/docs/configuration.md
index dac8bb1..4e4b781 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -380,13 +380,13 @@ Apart from these, the following properties are also 
available, and may be useful
 </tr>
 <tr>
   <td><code>spark.serializer.objectStreamReset</code></td>
-  <td>10000</td>
+  <td>100</td>
   <td>
     When serializing using org.apache.spark.serializer.JavaSerializer, the 
serializer caches
     objects to prevent writing redundant data, however that stops garbage 
collection of those
     objects. By calling 'reset' you flush that info from the serializer, and 
allow old
     objects to be collected. To turn off this periodic reset set it to a value 
&lt;= 0.
-    By default it will reset the serializer every 10,000 objects.
+    By default it will reset the serializer every 100 objects.
   </td>
 </tr>
 <tr>

Reply via email to