Re: [PR] [SPARK-57471][SQL] Populate input/output size metrics for JDBC reads and writes [spark]

via GitHub Wed, 17 Jun 2026 16:58:57 -0700


cloud-fan commented on code in PR #56550:
URL: https://github.com/apache/spark/pull/56550#discussion_r3432157908



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -383,6 +459,7 @@ object JdbcUtils extends Logging with SQLConfHelper {
             if (rs.wasNull) mutableRow.setNullAt(i)
             i = i + 1
           }
+          inputMetrics.incBytesRead(estimateInternalRowSize(mutableRow, 
schema))

Review Comment:
   **Design question (blocking): are `bytesRead`/`bytesWritten` the right 
metrics for this?** Their established, source-independent meaning is the 
*physical bytes moved to/from the external system* — `HadoopRDD`/`FileScanRDD` 
populate `bytesRead` from the FileSystem byte-statistics callback 
(`getBytesReadCallback`) or the on-disk split length 
(`split.inputSplit.value.getLength`), and the Hadoop write path sets 
`bytesWritten` the same way; the metric is documented as "reading/writing data 
from/to external systems." This change instead stores an estimate of *decoded 
logical row size*, which is neither the JDBC wire/protocol bytes nor any 
physical I/O figure. The effect is that the metric means different things 
depending on the source, so anything aggregating/comparing across sources 
(Spark UI Input/Output Size, cost/capacity/autoscaling tooling) silently mixes 
incompatible quantities — and a plausible non-zero value looks like a measured 
I/O figure when it isn't.
   
   Could you rethink the approach? Rather than overloading the task-level I/O 
metrics, consider adding new dedicated metrics on the SQL JDBC scan operator (a 
`SQLMetric` surfaced in the SQL UI, alongside the existing scan metrics like 
`queryExecutionTime` / `fetchAndTransformToInternalRowsNs`), clearly labeled as 
an estimated data size — plus a matching metric on the write side. That keeps 
the physical-I/O metrics' meaning intact while still exposing the size signal.
   
   _Separately, non-blocking:_ this adds a second per-row pass over all columns 
purely for the metric — and `estimateRowSize` does the same at the write loop 
(`:959`) — running unconditionally even when no consumer reads the metric. The 
per-column cost is O(1), so just a note; moot if the approach above is 
reconsidered.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57471][SQL] Populate input/output size metrics for JDBC reads and writes [spark]

Reply via email to