[ https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358958#comment-14358958 ]
Ilya Ganelin commented on SPARK-4927: ------------------------------------- Hi Sean - I have a code snippet that reproduced this. Let me send it to you in a bit - I don't have the means to run 1.3 in a cluster. Sent with Good (www.good.com) > Spark does not clean up properly during long jobs. > --------------------------------------------------- > > Key: SPARK-4927 > URL: https://issues.apache.org/jira/browse/SPARK-4927 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.1.0 > Reporter: Ilya Ganelin > > On a long running Spark job, Spark will eventually run out of memory on the > driver node due to metadata overhead from the shuffle operation. Spark will > continue to operate, however with drastically decreased performance (since > swapping now occurs with every operation). > The spark.cleanup.tll parameter allows a user to configure when cleanup > happens but the issue with doing this is that it isn’t done safely, e.g. If > this clears a cached RDD or active task in the middle of processing a stage, > this ultimately causes a KeyNotFoundException when the next stage attempts to > reference the cleared RDD or task. > There should be a sustainable mechanism for cleaning up stale metadata that > allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org