[
https://issues.apache.org/jira/browse/SPARK-25634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658343#comment-16658343
]
Felix Cheung commented on SPARK-25634:
--------------------------------------
how about off-heap and netty buffer usage?
> New Metrics in External Shuffle Service to help identify abusing application
> ----------------------------------------------------------------------------
>
> Key: SPARK-25634
> URL: https://issues.apache.org/jira/browse/SPARK-25634
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle
> Affects Versions: 2.4.0
> Reporter: Ye Zhou
> Priority: Minor
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of
> YARN NM aux service. External Shuffle Service is shared by all Spark
> applications. SPARK-24355 enables the threads reservation to handle
> non-ChunkFetchRequest. SPARK-21501 limits the memory usage for Guava Cache to
> avoid OOM in shuffle service which could crash NodeManager. But still some
> application may generate a large amount of shuffle blocks which could heavily
> decrease the performance on some shuffle servers. When this abusing behavior
> happens, it might further decreases the overall performance for other
> applications if they happen to use the same shuffle servers. We have been
> seeing issues like this in our cluster, but there is no way for us to figure
> out which application is abusing shuffle service.
> SPARK-18364 has enabled expose out shuffle service metrics to Hadoop Metrics
> System. It is better if we can have the following metrics and also metrics
> divided by applicationID:
> 1. *shuffle server on-heap memory consumption for caching shuffle indexes*
> 2. *breakdown of shuffle indexes caching memory consumption by local
> executors*
> We can generate metrics when
> ExternalShuffleBlockHandler-->getSortBasedShuffleBlockData, which will
> trigger the Cache load. We can roughly be able to get the metrics from the
> shuffleindexfile size when putting into the cache and moved out from the
> cache.
> 3. *shuffle server load for shuffle block fetch requests*
> 4. *breakdown of shuffle server block fetch requests load by remote executors*
> We can generate metrics in ExternalShuffleBlockHandler-->handleMessage when a
> new OpenBlocks message is received.
> Open discussion for more metrics that could potentially influence the overall
> shuffle service performance.
> We can print out those metrics which are divided by applicationIDs in log,
> since it is hard to define fixed key and use numerical value for this kind of
> metrics.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]