yadavay-amzn opened a new pull request, #56550:
URL: https://github.com/apache/spark/pull/56550

   ### What changes were proposed in this pull request?
   
   Populate the input/output **size (bytes)** task metrics for the JDBC 
datasource (both reads and writes). Previously only the record counts were 
populated and the byte-size metrics always reported 0, because JDBC does no 
Hadoop filesystem I/O (so the default `bytesRead` callback returns 0) and the 
write path never set `bytesWritten`.
   
   Since there is no filesystem to measure, the size is estimated on the Spark 
side from each rows schema and values in `JdbcUtils`:
   - Read path: after `inputMetrics.incRecordsRead(1)`, call 
`inputMetrics.incBytesRead(...)` with the estimated row size (the `InternalRow` 
and schema are already in scope).
   - Write path (`savePartition`): accumulate the estimated size per row and 
call `outMetrics.setBytesWritten(...)` alongside the existing 
`setRecordsWritten` on both completion paths.
   
   The estimate uses the actual byte length for variable-length values 
(`UTF8String.numBytes()` for strings on the read path; `Array[Byte].length` for 
binary) and `dataType.defaultSize` for fixed-width types; null fields 
contribute 0. On the write path the string size uses the character length to 
avoid per-row allocation in the hot write loop (exact for ASCII; a reasonable 
estimate otherwise). Both the v1 and v2 JDBC paths converge in `JdbcUtils`, so 
a single change covers both.
   
   ### Why are the changes needed?
   
   JDBC read/write tasks reported `bytesRead`/`bytesWritten` as 0, so users and 
tooling had no size signal for JDBC I/O (only row counts). A Spark-side 
estimate gives a useful, non-zero approximation.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. The input/output size metrics for JDBC reads and writes are now 
populated with an estimated byte size instead of always 0. These are estimates 
of the Spark-side row size, not exact wire bytes.
   
   ### How was this patch tested?
   
   New tests in `JDBCSuite` (read) and `JDBCWriteSuite` (write), using the 
existing `SparkListener.onTaskEnd` metrics pattern:
   - bytes are 0 on master and populated with the fix (TDD),
   - bytes scale with string width and with binary length (proving 
variable-length estimation, not a constant),
   - null values do not error and yield fewer bytes than the populated 
equivalent.
   
   ### Credit
   
   This issue was reported by **Craiu Constantin-Tiberiu** (@tibicraiu), who 
had started a fix, generously handed the ticket over, and attached a verified 
patch for reference. Thank you.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to