[jira] [Updated] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

Gengliang Wang (Jira) Fri, 11 Nov 2022 11:31:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gengliang Wang updated SPARK-41053:
-----------------------------------
    Description: 
After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
processing large applications by supporting a persistent 
KV-store(LevelDB/RocksDB) as the storage layer.

As for the live Spark UI, all the data is still stored in memory, which can 
bring memory pressures to the Spark driver for large applications.

For better stability of the Spark driver, I propose to
 * {*}Support storing all the UI data in a persistent KV store{*}. 
RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
fast enough to serve the write/read workload for live UI. SHS can leverage the 
persistent KV store to fasten its startup.

 * *Support a new Protobuf serializer for all the UI data.* The new serializer 
is supposed to be faster, according to benchmarks. It will be the default 
serializer for the persistent KV store of live UI. As for event logs, it is 
optional. The current serializer for UI data is JSON. When writing persistent 
KV-store, there is GZip compression. Since there is compression support in 
RocksDB/LevelDB, the new serializer won’t compress the output before writing to 
the persistent KV store. Here is a benchmark of writing/reading 100,000 
SQLExecutionUIData to/from RocksDB:

 
|*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
Size(MB)*|*Result total size in memory(MB)*|
|*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
|*Protobuf*|109.9|34.3|858|2105|





I am also proposing to support RocksDB instead of both LevelDB & RocksDB in the 
live UI.

SPIP: 
https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing

  was:
The current architecture of Spark live UI and Spark history server(SHS) is too 
simple to serve large clusters and heavy workloads:
 * Spark stores all the live UI date in memory. The size can be a few GBs and 
affects the driver's stability (OOM). 
 * There is a limitation of storing 1000 queries only. Note that we can’t 
simply increase the limitation under the current Architecture. I did a memory 
profiling. Storing one query execution detail can take 800KB while storing one 
task requires 0.3KB. So for 1000 SQL queries with 1000* 2000 tasks, the memory 
usage for query execution and task data will be 1.4GB. Spark UI stores UI data 
for jobs/stages/executors as well.  So to store 10k queries, it may take more 
than 14GB.
 * SHS has to parse JSON format event log for the initial start.  The 
uncompressed event logs can be as big as a few GBs, and the parse can be quite 
slow. Some users reported they had to wait for more than half an hour.

 

The proposal is to:
 # Store all the live UI data in local RocksDB with protobuf serialization.
 # The RocksDB files of live UI can be used on SHS directly.
 # If the RocksDB file is unavailable for SHS, event logs can be written with 
protobuf for faster replay.

 


> Better Spark UI scalability and Driver stability for large applications
> -----------------------------------------------------------------------
>
>                 Key: SPARK-41053
>                 URL: https://issues.apache.org/jira/browse/SPARK-41053
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Spark Core, Web UI
>    Affects Versions: 3.4.0
>            Reporter: Gengliang Wang
>            Priority: Major
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better stability of the Spark driver, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

Reply via email to