advancedxy commented on a change in pull request #22324: [SPARK-25237][SQL] 
Remove updateBytesReadWithFileSize in FileScanRDD
URL: https://github.com/apache/spark/pull/22324#discussion_r361353029
 
 

 ##########
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
 ##########
 @@ -473,6 +476,27 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext with Befo
       }
     }
   }
+
+  test("SPARK-25237 compute correct input metrics in FileScanRDD") {
+    withTempPath { p =>
+      val path = p.getAbsolutePath
+      spark.range(1000).repartition(1).write.csv(path)
+      val bytesReads = new mutable.ArrayBuffer[Long]()
+      val bytesReadListener = new SparkListener() {
+        override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
+          bytesReads += taskEnd.taskMetrics.inputMetrics.bytesRead
+        }
+      }
+      sparkContext.addSparkListener(bytesReadListener)
+      try {
+        spark.read.csv(path).limit(1).collect()
+        sparkContext.listenerBus.waitUntilEmpty(1000L)
+        assert(bytesReads.sum === 7860)
 
 Review comment:
   > In this test, Spark run with local[2] and each scan thread points to the 
same CSV file. Since each thread gets the file size thru Hadoop APIs, the total 
byteRead becomes 2 * the file size, IIUC.
   
   I am afraid it's not that case, csv will infer schema first, which will try 
to load the the first row in the path, then the actually read. That's why the 
input bytes read is doubled. It may be more reasonable to just write and read 
`text` file.
   
   As for `3930 = 3890 + 40`, the extra 40 bytes is the crc file size. Hadoop 
uses `ChecksumFileSystem` internally.
   
   And one more thing: this test case may be inaccurate. If the task completes 
successfully, all the data is consumed, `updateBytesReadWithFileSize` is a 
no-op, and `updateBytesRead()` in the close function will update the correct 
size.
   FYI @maropu 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to