advancedxy commented on a change in pull request #22324: [SPARK-25237][SQL]
Remove updateBytesReadWithFileSize in FileScanRDD
URL: https://github.com/apache/spark/pull/22324#discussion_r361353029
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
##########
@@ -473,6 +476,27 @@ class FileBasedDataSourceSuite extends QueryTest with
SharedSQLContext with Befo
}
}
}
+
+ test("SPARK-25237 compute correct input metrics in FileScanRDD") {
+ withTempPath { p =>
+ val path = p.getAbsolutePath
+ spark.range(1000).repartition(1).write.csv(path)
+ val bytesReads = new mutable.ArrayBuffer[Long]()
+ val bytesReadListener = new SparkListener() {
+ override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
+ bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
+ }
+ }
+ sparkContext.addSparkListener(bytesReadListener)
+ try {
+ spark.read.csv(path).limit(1).collect()
+ sparkContext.listenerBus.waitUntilEmpty(1000L)
+ assert(bytesReads.sum === 7860)
Review comment:
> In this test, Spark run with local[2] and each scan thread points to the
same CSV file. Since each thread gets the file size thru Hadoop APIs, the total
byteRead becomes 2 * the file size, IIUC.
I am afraid it's not that case, csv will infer schema first, which will try
to load the the first row in the path, then the actually read. That's why the
input bytes read is doubled. It may be more reasonable to just write and read
`text` file.
As for `3930 = 3890 + 40`, the extra 40 bytes is the crc file size. Hadoop
uses `ChecksumFileSystem` internally.
And one more thing: this test case may be inaccurate. If the task completes
successfully, all the data is consumed, `updateBytesReadWithFileSize` is a
no-op, and `updateBytesRead()` in the close function will update the correct
size.
FYI @maropu
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]