Re: [PR] [SPARK-57471][SQL] Populate input/output size metrics for JDBC reads and writes [spark]

via GitHub Wed, 01 Jul 2026 12:05:55 -0700


yadavay-amzn commented on code in PR #56550:
URL: https://github.com/apache/spark/pull/56550#discussion_r3508434000



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -878,6 +988,7 @@ object JdbcUtils extends Logging with SQLConfHelper {
           stmt.addBatch()
           rowCount += 1
           totalRowCount += 1
+          bytesAccumulator.foreach(_.add(measureRowSize(row, rddSchema)))

Review Comment:
   Good point - wrapped the write-side measureRowSize call in a NonFatal guard. 
It sits after stmt.addBatch(), so a swallowed measurement error can never 
affect the batch or the write. Added an end-to-end Binary write test (exercises 
the getAs[Array[Byte]] path that would ClassCastException if mistyped) plus 
variable-length safety coverage; Array measurement is unit-tested in 
JdbcUtilsSuite (H2 has no Array-write support for an e2e test). I kept the 
try/NonFatal wrap rather than making measureRowSize total-by-construction since 
it is zero-cost on the happy path and stays safe regardless of future type 
additions.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -378,11 +480,16 @@ object JdbcUtils extends Logging with SQLConfHelper {
         if (rs.next()) {
           inputMetrics.incRecordsRead(1)
           var i = 0
+          var rowSize = 0L
+          val measureSize = dataSizeMetric.isDefined
           while (i < getters.length) {
             getters(i).apply(rs, mutableRow, i)
             if (rs.wasNull) mutableRow.setNullAt(i)
+            // Accumulate actual decoded size within the same per-column pass.
+            if (measureSize) rowSize += columnSizers(i)(mutableRow)

Review Comment:
   The per-cell cost is O(1) - a per-column sizer function built once per scan 
(not per row), doing a numBytes() field read for variable-length types and a 
constant for fixed-width - so it is dominated by the ResultSet.getX() + network 
cost of the same cell, and the single-pass design avoids the extra row walk the 
earlier version had. I did not gate it behind a conf to avoid another knob, but 
if you would prefer it gatable (or want a decode microbenchmark for 
wide/high-row-count tables), I am happy to add either - let me know which you 
prefer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57471][SQL] Populate input/output size metrics for JDBC reads and writes [spark]

Reply via email to