Github user mridulm commented on a diff in the pull request: https://github.com/apache/spark/pull/89#discussion_r10559136 --- Diff: docs/configuration.md --- @@ -487,6 +477,88 @@ Apart from these, the following properties are also available, and may be useful </tr> </table> + +The following are the properties that can be used to schedule cleanup jobs at different levels. +The below mentioned metadata tuning parameters should be set with a lot of consideration and only where required. +Scheduling metadata cleaning in the middle of job can result in a lot of unnecessary re-computations. + +<table class="table"> +<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr> +<tr> + <td>spark.cleaner.ttl</td> + <td>(infinite)</td> + <td> + Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). + Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is + useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming + applications). Note that any RDD that persists in memory for more than this duration will be cleared as well. + </td> +</tr> +<tr> + <td>spark.cleaner.ttl.MAP_OUTPUT_TRACKER</td> + <td>spark.cleaner.ttl, with a min. value of 10 secs</td> + <td> + Cleans up the map containing the information of the mapper (the input block manager Id and the output result size) corresponding to a shuffle Id. + </td> --- End diff -- you might want to add that this takes precedence over spark.cleaner.ttl
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---