Github user StefanRRichter commented on a diff in the pull request:
https://github.com/apache/flink/pull/4019#discussion_r121630151
--- Diff:
flink-runtime/src/main/java/org/apache/flink/runtime/fs/hdfs/HadoopDataInputStream.java
---
@@ -31,11 +31,15 @@
*/
public final class HadoopDataInputStream extends FSDataInputStream {
+ /** Minimum amount of bytes to skip forward before we issue a seek
instead of discarding read */
+ private static final int MIN_SKIP_BYTES = 1024 * 1024;
--- End diff --
Right now, this is a purely "magic" number. The optimum should depend on
the dfs and the underlying fs. For now, this number is chosen "big enough" to
provide improvements for smaller seeks, and "small enough" to avoid
disadvantages over real seeks. While the minimum should be the page size, a
true optimum per system would be the amounts of bytes the can be consumed
within seektime. Unfortunately, seektime is not constant and devices as well as
dfs potentially also use read buffers and read-ahead. In the long run this
value could become configurable, but for now I have simply chosen a
conservative, relatively small value that should bring safe improvements for
small skips in meta data, that would hurt the most.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---