juliuszsompolski commented on code in PR #55711:
URL: https://github.com/apache/spark/pull/55711#discussion_r3197833230
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/BatchScanExec.scala:
##########
@@ -44,6 +47,20 @@ case class BatchScanExec(
@transient lazy val batch: Batch = if (scan == null) null else scan.toBatch
+ override protected lazy val sparkMetrics: Map[String, SQLMetric] = {
+ val name = "number of output rows"
+ val metric = table match {
+ // Use SLAM for the scan-output count when this scan reads on behalf of
a row-level DELETE,
+ // so that the driver-side derivation `numDeletedRows = numScannedRows -
numCopiedRows` in
+ // `ReplaceDataExec.getWriteSummary` stays correct under stage retries.
+ case rlot: RowLevelOperationTable if rlot.operation.command() == DELETE
=>
+ SQLLastAttemptMetrics.createMetric(sparkContext, name)
+ case _ =>
+ SQLMetrics.createMetric(sparkContext, name)
Review Comment:
There is some memory overhead - an array of num tasks size of partial values
by default, so let's say a couple hundred bytes to a couple of kilobytes.
Computational overhead should be negligible.
I think that in general all numOutputRows metrics all over Spark could
benefit from porting to SLAM, but wanted to make the blast radius as small as
possible here for now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]