cloud-fan commented on code in PR #56550:
URL: https://github.com/apache/spark/pull/56550#discussion_r3432157908
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -383,6 +459,7 @@ object JdbcUtils extends Logging with SQLConfHelper {
if (rs.wasNull) mutableRow.setNullAt(i)
i = i + 1
}
+ inputMetrics.incBytesRead(estimateInternalRowSize(mutableRow,
schema))
Review Comment:
**Design question (blocking): are `bytesRead`/`bytesWritten` the right
metrics for this?** Their established, source-independent meaning is the
*physical bytes moved to/from the external system* — `HadoopRDD`/`FileScanRDD`
populate `bytesRead` from the FileSystem byte-statistics callback
(`getBytesReadCallback`) or the on-disk split length
(`split.inputSplit.value.getLength`), and the Hadoop write path sets
`bytesWritten` the same way; the metric is documented as "reading/writing data
from/to external systems." This change instead stores an estimate of *decoded
logical row size*, which is neither the JDBC wire/protocol bytes nor any
physical I/O figure. The effect is that the metric means different things
depending on the source, so anything aggregating/comparing across sources
(Spark UI Input/Output Size, cost/capacity/autoscaling tooling) silently mixes
incompatible quantities — and a plausible non-zero value looks like a measured
I/O figure when it isn't.
Could you rethink the approach? Rather than overloading the task-level I/O
metrics, consider adding new dedicated metrics on the SQL JDBC scan operator (a
`SQLMetric` surfaced in the SQL UI, alongside the existing scan metrics like
`queryExecutionTime` / `fetchAndTransformToInternalRowsNs`), clearly labeled as
an estimated data size — plus a matching metric on the write side. That keeps
the physical-I/O metrics' meaning intact while still exposing the size signal.
_Separately, non-blocking:_ this adds a second per-row pass over all columns
purely for the metric — and `estimateRowSize` does the same at the write loop
(`:959`) — running unconditionally even when no consumer reads the metric. The
per-column cost is O(1), so just a note; moot if the approach above is
reconsidered.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]