hudi-bot opened a new issue, #15911:
URL: https://github.com/apache/hudi/issues/15911

   The code currently does an eager isCorruptedCheck for which we do a seek and 
then a read which invalidates our internal buffers in opened file stream to the 
log file and makes a call to DataNode to start a new blockReader.
   
   The seek + read becomes apparent when we do cross datacenter reads or where 
the latency to the file is HIGH. In cases, a single RPC will cost us about 
120ms + Cost of RPC (west coast to east coast) so this seek is bad for 
performance.
   
   Delaying the corrupt check also gives us many benefits in low latency env 
where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a 
moderately sized files of 250MB.
   
   NOTE:  The more number of log blocks to read, the greater the performance 
improvements.
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-6116
   - Type: Improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to