1996fanrui commented on pull request #13885:
URL: https://github.com/apache/flink/pull/13885#issuecomment-724092762


   > Are the HDFS input streams generally not buffered? Would it make sense to 
adjust the `HadoopDataInputStream` class to be buffered?
   
   Hi @StephanEwen , I am sorry to reply you so late.
   In the past two days, I read the relevant code of hdfs reading data and did 
some performance analysis.
   conclusion as below:
   The hdfs input stream has buffer by default. The default buffer size is 64KB.
   In theory, Use FSDataBufferedInputStream to wrap hdfsInputStream does not 
reduce the number of disk accesses. But the result shows: When buffer is added, 
restore time is reduced to one-third of the original.
   
   So, I did some performance analysis.
   From the point of view of CPU usage, When buffer is added cpu usage is 
greatly reduced.
   Using hdfsInputStream directly, the CPU usage rate is 60~70%.
   Use FSDataBufferedInputStream to wrap hdfsInputStream, the CPU usage rate is 
20~25%.
   
   Analyze why hdfsInputStream consumes CPU:
   The hdfs client contains a lot of statistical information, and the method 
call stack is relatively deep, each method will consume a little performance. 
As shown in the picture below:
   [flame graph remark 
link](https://drive.google.com/file/d/1zTwHdmSybAgyBGIIP71FLfMQ5DvK1eON/view?usp=sharing)
   
   If you use FSDataBufferedInputStream to wrap hdfsInputStream, there will be 
a buffer outside the hdfs client, avoiding a very deep call stack and a lot of 
hdfs statistics.
   
   Original flame graph link:
   [restore with 
buffer](https://drive.google.com/file/d/1jxBNzh2iIsrX__wfFjvTWiC9LPCCDS3A/view?usp=sharing)
   [restore without 
buffer](https://drive.google.com/file/d/1qDbqQC4bG34_ZsCrOpDaNouNQpvVXRWH/view?usp=sharing)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to