[ 
https://issues.apache.org/jira/browse/FLINK-6776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047654#comment-16047654
 ] 

ASF GitHub Bot commented on FLINK-6776:
---------------------------------------

Github user StefanRRichter commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4019#discussion_r121630151
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/fs/hdfs/HadoopDataInputStream.java
 ---
    @@ -31,11 +31,15 @@
      */
     public final class HadoopDataInputStream extends FSDataInputStream {
     
    +   /** Minimum amount of bytes to skip forward before we issue a seek 
instead of discarding read */
    +   private static final int MIN_SKIP_BYTES = 1024 * 1024;
    --- End diff --
    
    Right now, this is a purely "magic" number. The optimum should depend on 
the dfs and the underlying fs. For now, this number is chosen "big enough" to 
provide improvements for smaller seeks, and "small enough" to avoid 
disadvantages over real seeks. While the minimum should be the page size, a 
true optimum per system would be the amounts of bytes the can be consumed 
within seektime. Unfortunately, seektime is not constant and devices as well as 
dfs potentially also use read buffers and read-ahead. In the long run this 
value could become configurable, but for now I have simply chosen a 
conservative, relatively small value that should bring safe improvements for 
small skips in meta data, that would hurt the most.


> Use skip instead of seek for small forward repositioning in DFS streams
> -----------------------------------------------------------------------
>
>                 Key: FLINK-6776
>                 URL: https://issues.apache.org/jira/browse/FLINK-6776
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>            Reporter: Stefan Richter
>            Assignee: Stefan Richter
>            Priority: Minor
>
> Reading checkpoint meta data and finding key-groups in restores sometimes 
> require to seek in input streams. Currently, we always use a seek, even for 
> small position changes. As small true seeks are far more expensive than small 
> reads/skips, we should just skip over small gaps instead of performing the 
> seek.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to