Github user guoxiaolongzte commented on a diff in the pull request:
https://github.com/apache/spark/pull/20437#discussion_r165251810
--- Diff:
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
---
@@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
val metadata = Map(
"files" -> newFiles.toList,
StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
- val inputInfo = StreamInputInfo(id, 0, metadata)
+ val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata)
--- End diff --
If you can add a switch parameter, the default value is false.
If it is true, then it needs to be count (read the file again) so that the
records can be correctly counted. Of course, it shows that when the parameter
is opened to true, the streaming performance problem will be affected.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]