1996fanrui commented on pull request #13885:
URL: https://github.com/apache/flink/pull/13885#issuecomment-724424008


   > Thanks a lot for that deep diagnosis.
   > 
   > Given that result, would it make sense to buffer all Hadoop streams in 
Flink (also affecting S3, etc.), or just the DFS streams (HDFS)? Do all streams 
have such high statistics/metrics costs, or just the HDFS input stream?
   > 
   > In any case, it looks like we should add a `BufferedHadoopDataInputStream` 
that is a slight modification of the `HadoopDataInputStream` class with an 
internal buffer, and use that in the `HadoopFileSystem` when returning a 
stream. That way we should cover all cases.
   
   It may be to make sense to buffer most of the streams in Flink.
   
   First of all, it is beneficial to add buffer to `HadoopDataInputStream`. 
Secondly, there is no buffer in LocalDataInputStream, adding buffer can also 
improve IO performance. There will be some benefits for LocalRecovery.
   
   For S3, I don't know much, and there is no S3 environment. I may not have a 
say.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to