Github user steveloughran commented on a diff in the pull request:
https://github.com/apache/spark/pull/22752#discussion_r226243409
--- Diff:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -449,7 +450,7 @@ private[history] class FsHistoryProvider(conf:
SparkConf, clock: Clock)
listing.write(info.copy(lastProcessed = newLastScanTime,
fileSize = entry.getLen()))
}
- if (info.fileSize < entry.getLen()) {
+ if (info.fileSize < entry.getLen() ||
checkAbsoluteLength(info, entry)) {
--- End diff --
...there's no timetable for that getLength thing, but if HDFS already
supports the API, I'm more motivated to implement it. It has benefits in cloud
stores in general
1. saves apps going an up front HEAD/getFileStatus() to know how long their
data is; the GET should return it.
2. for S3 Select, you get back the filtered data so don't know how much you
will see until the GET is issued
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]