StephanEwen commented on pull request #13885: URL: https://github.com/apache/flink/pull/13885#issuecomment-724230980
Thanks a lot for that deep diagnosis. Given that result, would it make sense to buffer all Hadoop streams in Flink (also affecting S3, etc.), or just the DFS streams (HDFS)? Do all streams have such high statistics/metrics costs, or just the HDFS input stream? In any case, it looks like we should add a `BufferedHadoopDataInputStream` that is a slight modification of the `HadoopDataInputStream` class with an internal buffer, and use that in the `HadoopFileSystem` when returning a stream. That way we should cover all cases. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
