[jira] [Comment Edited] (SPARK-9103) Tracking spark's memory usage

Zhang, Liye (JIRA) Fri, 25 Sep 2015 10:56:35 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14908404#comment-14908404
 ]


Zhang, Liye edited comment on SPARK-9103 at 9/25/15 5:56 PM:
-------------------------------------------------------------

Hi [~irashid], thanks for reviewing the doc. 
{quote}
1) Will the proposed design cover SPARK-9111, getting the memory when the 
executor dies abnormally, (esp when killed by yarn)? It seems to me the answer 
is "no", which is fine, that can be tackled separately, I just wanted to 
clarify.
{quote}
You are right, the answer is "no". This design is for phase 1, we can move it 
on later to cover [SPARK-9111|https://issues.apache.org/jira/browse/SPARK-9111].

{quote}
I see the complexity of having overlapping stages, but I wonder if it could be 
simplified somewhat. It seems to me you just need to maintain a 
executorToLatestMetrics: Map[executor, metrics], and then on every stage 
complete, you just log them all?
{quote}
Since we want to reduce the number of events to log, I didn't find a way to 
simplify this for overlapping stages. And in the current implementation, we log 
all the ExectorMetrics of all the executors when executor complete. I think 
this can be simplified by only log ExecutorMetrics of executors that is related 
to the stage instead of all the executors. This will reduce a lot of events to 
log if there are many stages running on different executors.

{quote}
but it seems like there is more state & a bit more logging going on
{quote}
I don't quite understand, what do you mean about "*more state and more logging 
going on*", can you explain it further?

{quote}
 I don't fully understand why you need to log both "CHB1" and "HB3" in your 
example.
{quote}
That is because the "CHB1" is the combined event, and "HB3" is the real event, 
we have to log "HB3" because there might be no heartbeat received for the stage 
that after "HB3" (just like stage2 in figure-1 described in the doc). And for 
that stage, it will use "HB3" instead of "CHB1" because "CHB1" is not the 
correct event it should refer to. 


was (Author: liyezhang556520):
Hi @Imran Rashid, thanks for reviewing the doc. 
{quote}
1) Will the proposed design cover SPARK-9111, getting the memory when the 
executor dies abnormally, (esp when killed by yarn)? It seems to me the answer 
is "no", which is fine, that can be tackled separately, I just wanted to 
clarify.
{quote}
You are right, the answer is "no". This design is for phase 1, we can move it 
on later to cover [SPARK-9111|https://issues.apache.org/jira/browse/SPARK-9111].

{quote}
I see the complexity of having overlapping stages, but I wonder if it could be 
simplified somewhat. It seems to me you just need to maintain a 
executorToLatestMetrics: Map[executor, metrics], and then on every stage 
complete, you just log them all?
{quote}
Since we want to reduce the number of events to log, I didn't find a way to 
simplify this for overlapping stages. And in the current implementation, we log 
all the ExectorMetrics of all the executors when executor complete. I think 
this can be simplified by only log ExecutorMetrics of executors that is related 
to the stage instead of all the executors. This will reduce a lot of events to 
log if there are many stages running on different executors.

{quote}
but it seems like there is more state & a bit more logging going on
{quote}
I don't quite understand, what do you mean about "*more state and more logging 
going on*", can you explain it further?

{quote}
 I don't fully understand why you need to log both "CHB1" and "HB3" in your 
example.
{quote}
That is because the "CHB1" is the combined event, and "HB3" is the real event, 
we have to log "HB3" because there might be no heartbeat received for the stage 
that after "HB3" (just like stage2 in figure-1 described in the doc). And for 
that stage, it will use "HB3" instead of "CHB1" because "CHB1" is not the 
correct event it should refer to. 

> Tracking spark's memory usage
> -----------------------------
>
>                 Key: SPARK-9103
>                 URL: https://issues.apache.org/jira/browse/SPARK-9103
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Spark Core, Web UI
>            Reporter: Zhang, Liye
>         Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-9103) Tracking spark's memory usage

Reply via email to