StephanEwen commented on pull request #13885:
URL: https://github.com/apache/flink/pull/13885#issuecomment-724230980


   Thanks a lot for that deep diagnosis.
   
   Given that result, would it make sense to buffer all Hadoop streams in Flink 
(also affecting S3, etc.), or just the DFS streams (HDFS)? Do all streams have 
such high statistics/metrics costs, or just the HDFS input stream?
   
   In any case, it looks like we should add a `BufferedHadoopDataInputStream` 
that is a slight modification of the `HadoopDataInputStream` class with an 
internal buffer, and use that in the `HadoopFileSystem` when returning a 
stream. That way we should cover all cases.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to