[
https://issues.apache.org/jira/browse/SPARK-38309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-38309:
------------------------------------
Assignee: (was: Apache Spark)
> SHS has incorrect percentiles for shuffle read bytes and shuffle total blocks
> metrics
> -------------------------------------------------------------------------------------
>
> Key: SPARK-38309
> URL: https://issues.apache.org/jira/browse/SPARK-38309
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.1.0
> Reporter: Rob Reeves
> Priority: Major
> Labels: correctness
> Attachments: image-2022-02-23-14-19-33-255.png
>
>
> *Background*
> In [this PR|https://github.com/apache/spark/pull/26508] (SPARK-26260) the SHS
> stage metric percentiles were updated to only include successful tasks when
> using disk storage. It did this by making the values for each metric negative
> when the task is not in a successful state. This approach was chosen to avoid
> breaking changes to disk storage. See [this
> comment|https://github.com/apache/spark/pull/26508#issuecomment-554540314]
> for context.
> To get the percentiles, it reads the metric values, starting at 0, in
> ascending order. This filters out all tasks that are not successful because
> the values are less than 0. To get the percentile values it scales the
> percentiles to the list index of successful tasks. For example if there are
> 200 tasks and you want percentiles [0, 25, 50, 75, 100] the lookup indexes in
> the task collection are
> *Issue*
> For metrics 1) shuffle total reads and 2) shuffle total blocks, the above PR
> incorrectly makes the metric indices positive. This means tasks that are not
> successful are included in the percentile calculations. The percentile lookup
> index calculation is still based on the number of successful task so the
> wrong task metric is returned for a given percentile. This was not caught
> because the unit test only verified values for one metric, executorRunTime.
> *Steps to Reproduce*
> _SHS UI_
> # Find a spark application in the SHS that has failed tasks for a stage with
> shuffle read.
> # Navigate to the stage UI.
> # Look at the max shuffle read size in the summary metrics
> # Sort the tasks by shuffle read size descending. You'll see it doesn't
> match step 3.
>
> !image-2022-02-23-14-19-33-255.png|width=789,height=389!
> _API_
> # For the same stage in the above repro steps, make a request to the task
> summary endpoint (e.g.
> /api/v1/applications/application_1632281309592_21294517/1/stages/6/0/taskSummary?quantiles=0,0.25,0.5,0.75,1.0)
> # Look at the shuffleReadMetrics.readBytes and
> shuffleReadMetrics.totalBlocksFetched. You will see -2 for at least some of
> the lower percentiles and the positive values will also be wrong.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]