[GitHub] spark pull request #20437: [SPARK-23270][Streaming][WEB-UI]FileInputDStream ...

jerryshao Wed, 31 Jan 2018 17:46:26 -0800

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20437#discussion_r165240426
  
    --- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 ---
    @@ -157,7 +157,7 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]](
         val metadata = Map(
           "files" -> newFiles.toList,
           StreamInputInfo.METADATA_KEY_DESCRIPTION -> newFiles.mkString("\n"))
    -    val inputInfo = StreamInputInfo(id, 0, metadata)
    +    val inputInfo = StreamInputInfo(id, rdds.map(_.count).sum, metadata)
    --- End diff --
    
    What's the difference between using HDFS API to count in another thread 
compared to current solution? You still cannot avoid reading files twice.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20437: [SPARK-23270][Streaming][WEB-UI]FileInputDStream ...

Reply via email to