cloud-fan commented on code in PR #55711:
URL: https://github.com/apache/spark/pull/55711#discussion_r3273233880
##########
sql/core/src/test/scala/org/apache/spark/sql/connector/MergeIntoTableSuiteBase.scala:
##########
@@ -2663,6 +2664,57 @@ abstract class MergeIntoTableSuiteBase extends
RowLevelOperationSuiteBase
}
}
+ test("metric values are stable across stage retries") {
+ // The join in the MERGE plan introduces a shuffle (with broadcast
disabled). The
+ // DAGScheduler corrupts the first attempt of every upstream shuffle map
stage, forcing
+ // the MergeRowsExec stage to retry. With plain SQLMetrics the row
counters would double
+ // up across attempts, but SQLLastAttemptMetric reports only the last
attempt, so the
+ // values surfaced via `MergeSummary` remain correct.
Review Comment:
Per your own TODO in the conversation thread, the current fetch-failure
injection doesn't actually retry the `MergeRowsExec`/writer stage — only
upstream shuffle map stages retry. With plain `SQLMetric`, the `numTargetRows*`
counters wouldn't double up under this injection, so this test passes equally
well without the SLAM swap. Suggested rewording:
```suggestion
// The join in the MERGE plan introduces a shuffle (with broadcast
disabled), and the
// DAGScheduler corrupts the first attempt of every upstream shuffle map
stage. Note:
// the current fetch-failure injection does not retry the
MergeRowsExec/writer stage,
// so this test passes equally well with plain SQLMetric — it only
exercises the
// SLAM-aware read path. Follow-up #55738 will add infra to actually
retry the writer
// stage and exercise the SLAM behavior end-to-end for MERGE.
```
##########
sql/core/src/test/scala/org/apache/spark/sql/connector/UpdateTableSuiteBase.scala:
##########
@@ -340,6 +341,46 @@ abstract class UpdateTableSuiteBase extends
RowLevelOperationSuiteBase {
checkUpdateMetrics(numUpdatedRows = 2, numCopiedRows = 1)
}
+ test("metric values are stable across stage retries") {
+ // Force a shuffle in the UPDATE plan via an IN-subquery (with broadcast
disabled), then
+ // have the DAGScheduler corrupt the first attempt of every upstream
shuffle map stage so
+ // the writer stage has to retry. With a plain SQLMetric the row counters
would double up
+ // across attempts, but SQLLastAttemptMetric reports only the last
attempt, so the values
+ // surfaced via `UpdateSummary` remain correct.
Review Comment:
Same point as on the MERGE test: per the TODO in the conversation thread,
the writer stage doesn't actually retry under the current fetch-failure
injection, so `numUpdatedRows` / `numCopiedRows` wouldn't double up with plain
`SQLMetric` either. Suggested rewording:
```suggestion
// Force a shuffle in the UPDATE plan via an IN-subquery (with broadcast
disabled), then
// have the DAGScheduler corrupt the first attempt of every upstream
shuffle map stage.
// Note: the current fetch-failure injection does not retry the writer
stage, so this
// test passes equally well with plain SQLMetric — it only exercises the
SLAM-aware
// read path. Follow-up #55738 will add infra to actually retry the
writer stage and
// exercise the SLAM behavior end-to-end for UPDATE.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]