Github user devaraj-kavali commented on a diff in the pull request: https://github.com/apache/spark/pull/22752#discussion_r226014153 --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala --- @@ -449,7 +450,7 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock) listing.write(info.copy(lastProcessed = newLastScanTime, fileSize = entry.getLen())) } - if (info.fileSize < entry.getLen()) { + if (info.fileSize < entry.getLen() || checkAbsoluteLength(info, entry)) { --- End diff -- Thanks @steveloughran for looking into this. > Have you looked @ this getFileLength() call to see how well it updates? I looked at the DFSInputStream.getFileLength() api, it gives locatedBlocks.getFileLength() + lastBlockBeingWrittenLength, here locatedBlocks.getFileLength() is the value got from NameNode for all the completed blocks and lastBlockBeingWrittenLength is the lastblock lenth from DataNode which is not the completed block. > FwIW HADOOP-15606 proposes adding a method like this for all streams Thanks for the pointer, once this is available we can update to use it.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org