Marcelo Masiero Vanzin created SPARK-29562:
----------------------------------------------

             Summary: SQLAppStatusListener metrics aggregation is slow and 
memory hungry
                 Key: SPARK-29562
                 URL: https://issues.apache.org/jira/browse/SPARK-29562
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.4
            Reporter: Marcelo Masiero Vanzin


While {{SQLAppStatusListener}} was added in 2.3, the aggregation code is very 
similar to what it was previously, so I'm sure this is even older.

Long story short, the aggregation code 
({{SQLAppStatusListener.aggregateMetrics}}) is very, very slow, and can take a 
non-trivial amount of time with large queries, aside from using a ton of memory.

There are also cascading issues caused by that: since it's called from an event 
handler, it can slow down event processing, causing events to be dropped, which 
can cause listeners to miss important events that would tell them to free up 
internal state (and, thus, memory).

To given an anecdotal example, one app I looked at ran into the "events being 
dropped" issue, which caused the listener to accumulate state for 100s of live 
stages, even though most of them were already finished. That lead to a few GB 
of memory being wasted due to finished stages that were still being tracked.

Here, though, I'd like to focus on {{SQLAppStatusListener.aggregateMetrics}} 
and making it faster. We should look at the other issues (unblocking event 
processing, cleaning up of stale data in listeners) separately.

(I also remember someone in the past trying to fix something in this area, but 
couldn't find a PR nor an open bug.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to