[ https://issues.apache.org/jira/browse/SPARK-41541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-41541. ---------------------------------- Fix Version/s: 3.3.2 3.1.4 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 39086 [https://github.com/apache/spark/pull/39086] > Fix wrong child call in SQLShuffleWriteMetricsReporter.decRecordsWritten() > -------------------------------------------------------------------------- > > Key: SPARK-41541 > URL: https://issues.apache.org/jira/browse/SPARK-41541 > Project: Spark > Issue Type: Improvement > Components: Shuffle > Affects Versions: 3.0.0 > Reporter: Josh Rosen > Assignee: Josh Rosen > Priority: Major > Fix For: 3.3.2, 3.1.4, 3.2.3, 3.4.0 > > > In the {{SQLShuffleWriteMetricsReporter.decRecordsWritten}} method, a call to > a child accidentally decrements _bytesWritten_ instead of > {_}recordsWritten{_}: > {code:java} > override def decRecordsWritten(v: Long): Unit = { > metricsReporter.decBytesWritten(v) > _recordsWritten.set(_recordsWritten.value - v) > } {code} > One of the situations where {{decRecordsWritten}} is called while reverting > shuffle writes from failed/canceled tasks. Due to the mixup in these calls, > the _recordsWritten_ metric ends up being _v_ records too high (since it > wasn't decremented) and the _bytesWritten_ metric ends up _v_ records too > low, causing some failed tasks' write metrics to look like > {code:java} > {"Shuffle Bytes Written":-2109,"Shuffle Write Time":2923270,"Shuffle Records > Written":2109} {code} > instead of > {code:java} > {"Shuffle Bytes Written":0,"Shuffle Write Time":2923270,"Shuffle Records > Written":0} {code} > I'll submit a fix for this. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org