[
https://issues.apache.org/jira/browse/SPARK-21493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-21493:
---------------------------------
Labels: bulk-closed (was: )
> Add more metrics to External Shuffle Service
> --------------------------------------------
>
> Key: SPARK-21493
> URL: https://issues.apache.org/jira/browse/SPARK-21493
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 2.2.0
> Reporter: Raajay Viswanathan
> Priority: Minor
> Labels: bulk-closed
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> The current set of metrics in the external shuffle service are fairly
> limited. To debug failure of the shuffle service, it would be good to get
> more information regarding the state of the shuffle service. As a first cut,
> the following metrics seem important:
> 1. The amount of heap memory used by the External Shuffle Service process
> 2. The amount of direct buffer (off-heap) memory allocated to External
> Shuffle Service. In the external shuffle service, Netty uses off-heap memory.
> Monitoring its usage can help in allocating appropriate resources and can
> also be used to raise alarms when the allocated memory exceeds a threshold.
> 3. The queue length in Netty event loops. Chunk Fetch Requests (or) Open
> Block requests can be dropped as a result of Netty queue overflows (resulting
> in FetchFailure). Having hard data on queue size can help in attributing
> cause of failures.
> Please let me know of other metrics (from Shuffle Service perspective) that
> would be good to have.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]