Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/22594#discussion_r222875551
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
---
@@ -104,12 +104,14 @@ class FileScanRDD(
val nextElement = currentIterator.next()
// TODO: we should have a better separation of row based and batch
based scan, so that we
// don't need to run this `if` for every record.
+ val preNumRecordsRead = inputMetrics.recordsRead
if (nextElement.isInstanceOf[ColumnarBatch]) {
inputMetrics.incRecordsRead(nextElement.asInstanceOf[ColumnarBatch].numRows())
} else {
inputMetrics.incRecordsRead(1)
}
- if (inputMetrics.recordsRead %
SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {
--- End diff --
The original goal here is to avoid updating it every record, because it is
too expensive. I am not sure what is the goal of your changes. Try to write a
test case in SQLMetricsSuite?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]