[ 
https://issues.apache.org/jira/browse/SPARK-21493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21493:
---------------------------------
    Labels: bulk-closed  (was: )

> Add more metrics to External Shuffle Service
> --------------------------------------------
>
>                 Key: SPARK-21493
>                 URL: https://issues.apache.org/jira/browse/SPARK-21493
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Raajay Viswanathan
>            Priority: Minor
>              Labels: bulk-closed
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The current set of metrics in the external shuffle service are fairly 
> limited. To debug failure of the shuffle service, it would be good to get 
> more information regarding the state of the shuffle service. As a first cut, 
> the following metrics seem important:
> 1. The amount of heap memory used by the External Shuffle Service process
> 2. The amount of direct buffer (off-heap) memory allocated to External 
> Shuffle Service. In the external shuffle service, Netty uses off-heap memory. 
> Monitoring its usage can help in allocating appropriate resources and can 
> also be used to raise alarms when the allocated memory exceeds a threshold.
> 3. The queue length in Netty event loops. Chunk Fetch Requests (or) Open 
> Block requests can be dropped as a result of Netty queue overflows (resulting 
> in FetchFailure). Having hard data on queue size can help in attributing 
> cause of failures.
> Please let me know of other metrics (from Shuffle Service perspective) that 
> would be good to have. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to