yadavay-amzn commented on code in PR #56550:
URL: https://github.com/apache/spark/pull/56550#discussion_r3508434000
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -878,6 +988,7 @@ object JdbcUtils extends Logging with SQLConfHelper {
stmt.addBatch()
rowCount += 1
totalRowCount += 1
+ bytesAccumulator.foreach(_.add(measureRowSize(row, rddSchema)))
Review Comment:
Good point - wrapped the write-side measureRowSize call in a NonFatal guard.
It sits after stmt.addBatch(), so a swallowed measurement error can never
affect the batch or the write. Added an end-to-end Binary write test (exercises
the getAs[Array[Byte]] path that would ClassCastException if mistyped) plus
variable-length safety coverage; Array measurement is unit-tested in
JdbcUtilsSuite (H2 has no Array-write support for an e2e test). I kept the
try/NonFatal wrap rather than making measureRowSize total-by-construction since
it is zero-cost on the happy path and stays safe regardless of future type
additions.
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala:
##########
@@ -378,11 +480,16 @@ object JdbcUtils extends Logging with SQLConfHelper {
if (rs.next()) {
inputMetrics.incRecordsRead(1)
var i = 0
+ var rowSize = 0L
+ val measureSize = dataSizeMetric.isDefined
while (i < getters.length) {
getters(i).apply(rs, mutableRow, i)
if (rs.wasNull) mutableRow.setNullAt(i)
+ // Accumulate actual decoded size within the same per-column pass.
+ if (measureSize) rowSize += columnSizers(i)(mutableRow)
Review Comment:
The per-cell cost is O(1) - a per-column sizer function built once per scan
(not per row), doing a numBytes() field read for variable-length types and a
constant for fixed-width - so it is dominated by the ResultSet.getX() + network
cost of the same cell, and the single-pass design avoids the extra row walk the
earlier version had. I did not gate it behind a conf to avoid another knob, but
if you would prefer it gatable (or want a decode microbenchmark for
wide/high-row-count tables), I am happy to add either - let me know which you
prefer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]