Re: [PR] [SPARK-57471][SQL] Populate input/output size metrics for JDBC reads and writes [spark]

via GitHub Wed, 01 Jul 2026 04:13:01 -0700


MaxGekk commented on code in PR #56550:
URL: https://github.com/apache/spark/pull/56550#discussion_r3505367898



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -378,11 +480,16 @@ object JdbcUtils extends Logging with SQLConfHelper {
         if (rs.next()) {
           inputMetrics.incRecordsRead(1)
           var i = 0
+          var rowSize = 0L
+          val measureSize = dataSizeMetric.isDefined
           while (i < getters.length) {
             getters(i).apply(rs, mutableRow, i)
             if (rs.wasNull) mutableRow.setNullAt(i)
+            // Accumulate actual decoded size within the same per-column pass.
+            if (measureSize) rowSize += columnSizers(i)(mutableRow)

Review Comment:
   Question (non-blocking, perf): `measureSize` is `dataSizeMetric.isDefined`, 
and `JDBCRDD` always passes `Some(...)`, so this per-column sizer (indirect 
call + null check + `numBytes()`) runs on **every cell of every JDBC scan**, 
whether or not the metric is consumed. The single-pass design keeps it cheap, 
but it's an unconditional add to the decode hot path that all JDBC reads now 
pay.
   
   Is there a read-decode microbenchmark showing the overhead is negligible 
(esp. for wide tables / high row counts), or would it be worth making the 
measurement gatable?



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -878,6 +988,7 @@ object JdbcUtils extends Logging with SQLConfHelper {
           stmt.addBatch()
           rowCount += 1
           totalRowCount += 1
+          bytesAccumulator.foreach(_.add(measureRowSize(row, rddSchema)))

Review Comment:
   Question (non-blocking): this measurement call sits inside the write `try`, 
in the per-row loop, **before `conn.commit()`**. So although it's only a 
metrics helper, if `measureRowSize` throws it aborts the write and rolls back 
(transactional) / leaves dirty rows (non-transactional) — turning a metrics 
feature into a write-breaking regression.
   
   The throw surface is its type-specific extraction: `row.getString(i)`, 
`row.getAs[Array[Byte]](i)`, `row.getSeq[Any](i)`, and 
`e.asInstanceOf[String]`. For strictly schema-conformant external `Row`s these 
are safe, but a `ClassCastException` on a Binary/Array/boxed-element edge would 
fail a previously-successful write.
   
   Two things: (1) is `measureRowSize` guaranteed *total* over every 
JDBC-writable type (or worth wrapping so a measurement error can never fail the 
write)? (2) do the write tests exercise **Binary and Array writes end-to-end**? 
The current assertions look like `value > 0` on simple types, which wouldn't 
catch a cast failure on those paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57471][SQL] Populate input/output size metrics for JDBC reads and writes [spark]

Reply via email to