Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/20437#discussion_r165240426 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( val metadata = Map( "files" -> newFiles.toList, StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n")) - val inputInfo = StreamInputInfo(id, 0, metadata) + val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata) --- End diff -- What's the difference between using HDFS API to count in another thread compared to current solution? You still cannot avoid reading files twice.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org