1996fanrui commented on pull request #13885: URL: https://github.com/apache/flink/pull/13885#issuecomment-724092762
> Are the HDFS input streams generally not buffered? Would it make sense to adjust the `HadoopDataInputStream` class to be buffered? Hi @StephanEwen , I am sorry to reply you so late. In the past two days, I read the relevant code of hdfs reading data and did some performance analysis. conclusion as below: The hdfs input stream has buffer by default. The default buffer size is 64KB. In theory, Use FSDataBufferedInputStream to wrap hdfsInputStream does not reduce the number of disk accesses. But the result shows: When buffer is added, restore time is reduced to one-third of the original. So, I did some performance analysis. From the point of view of CPU usage, When buffer is added cpu usage is greatly reduced. Using hdfsInputStream directly, the CPU usage rate is 60~70%. Use FSDataBufferedInputStream to wrap hdfsInputStream, the CPU usage rate is 20~25%. Analyze why hdfsInputStream consumes CPU: The hdfs client contains a lot of statistical information, and the method call stack is relatively deep, each method will consume a little performance. As shown in the picture below: [flame graph remark link](https://drive.google.com/file/d/1zTwHdmSybAgyBGIIP71FLfMQ5DvK1eON/view?usp=sharing) If you use FSDataBufferedInputStream to wrap hdfsInputStream, there will be a buffer outside the hdfs client, avoiding a very deep call stack and a lot of hdfs statistics. Original flame graph link: [restore with buffer](https://drive.google.com/file/d/1jxBNzh2iIsrX__wfFjvTWiC9LPCCDS3A/view?usp=sharing) [restore without buffer](https://drive.google.com/file/d/1qDbqQC4bG34_ZsCrOpDaNouNQpvVXRWH/view?usp=sharing) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
