yadavay-amzn opened a new pull request, #56550: URL: https://github.com/apache/spark/pull/56550
### What changes were proposed in this pull request? Populate the input/output **size (bytes)** task metrics for the JDBC datasource (both reads and writes). Previously only the record counts were populated and the byte-size metrics always reported 0, because JDBC does no Hadoop filesystem I/O (so the default `bytesRead` callback returns 0) and the write path never set `bytesWritten`. Since there is no filesystem to measure, the size is estimated on the Spark side from each rows schema and values in `JdbcUtils`: - Read path: after `inputMetrics.incRecordsRead(1)`, call `inputMetrics.incBytesRead(...)` with the estimated row size (the `InternalRow` and schema are already in scope). - Write path (`savePartition`): accumulate the estimated size per row and call `outMetrics.setBytesWritten(...)` alongside the existing `setRecordsWritten` on both completion paths. The estimate uses the actual byte length for variable-length values (`UTF8String.numBytes()` for strings on the read path; `Array[Byte].length` for binary) and `dataType.defaultSize` for fixed-width types; null fields contribute 0. On the write path the string size uses the character length to avoid per-row allocation in the hot write loop (exact for ASCII; a reasonable estimate otherwise). Both the v1 and v2 JDBC paths converge in `JdbcUtils`, so a single change covers both. ### Why are the changes needed? JDBC read/write tasks reported `bytesRead`/`bytesWritten` as 0, so users and tooling had no size signal for JDBC I/O (only row counts). A Spark-side estimate gives a useful, non-zero approximation. ### Does this PR introduce _any_ user-facing change? Yes. The input/output size metrics for JDBC reads and writes are now populated with an estimated byte size instead of always 0. These are estimates of the Spark-side row size, not exact wire bytes. ### How was this patch tested? New tests in `JDBCSuite` (read) and `JDBCWriteSuite` (write), using the existing `SparkListener.onTaskEnd` metrics pattern: - bytes are 0 on master and populated with the fix (TDD), - bytes scale with string width and with binary length (proving variable-length estimation, not a constant), - null values do not error and yield fewer bytes than the populated equivalent. ### Credit This issue was reported by **Craiu Constantin-Tiberiu** (@tibicraiu), who had started a fix, generously handed the ticket over, and attached a verified patch for reference. Thank you. ### Was this patch authored or co-authored using generative AI tooling? Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
