[ 
https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313947#comment-14313947
 ] 

Matt Cheah edited comment on SPARK-4906 at 2/10/15 10:15 AM:
-------------------------------------------------------------

Spark has logic for failing a stage if there are too many task failures.

Keeping the entire UI state is problematic however even without stack traces. 
Just having a large number of jobs accumulating in the master along with each 
of those jobs having a large number of tasks can bloat the heap on the master 
because of the UI state.

I don't see why we can't make JobProgressListener use a Spillable object or 
something similar to keep some of the UI state on disk. Maybe even maintain the 
state as compressed bytes in memory, if we don't want to deal with the hassles 
of disk spilling?


was (Author: mcheah):
Spark has logic for failing a stage if there are too many task failures.

Keeping the entire UI state is problematic however even without stack traces. 
Just having a large number of jobs accumulating in the master along with each 
of those jobs having a large number of tasks can bloat the heap on the master 
because of the UI state.

> Spark master OOMs with exception stack trace stored in JobProgressListener
> --------------------------------------------------------------------------
>
>                 Key: SPARK-4906
>                 URL: https://issues.apache.org/jira/browse/SPARK-4906
>             Project: Spark
>          Issue Type: Bug
>          Components: Web UI
>    Affects Versions: 1.1.1
>            Reporter: Mingyu Kim
>
> Spark master was OOMing with a lot of stack traces retained in 
> JobProgressListener. The object dependency goes like the following.
> JobProgressListener.stageIdToData => StageUIData.taskData => 
> TaskUIData.errorMessage
> Each error message is ~10kb since it has the entire stack trace. As we have a 
> lot of tasks, when all of the tasks across multiple stages go bad, these 
> error messages accounted for 0.5GB of heap at some point.
> Please correct me if I'm wrong, but it looks like all the task info for 
> running applications are kept in memory, which means it's almost always bound 
> to OOM for long-running applications. Would it make sense to fix this, for 
> example, by spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to