Prashant Wason created HUDI-6116:
------------------------------------

             Summary: Optimize log block reading by removing seeks to check 
corrupted blocks
                 Key: HUDI-6116
                 URL: https://issues.apache.org/jira/browse/HUDI-6116
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Prashant Wason
            Assignee: Prashant Wason


The code currently does an eager isCorruptedCheck for which we do a seek and 
then a read which invalidates our internal buffers in opened file stream to the 
log file and makes a call to DataNode to start a new blockReader.

The seek + read becomes apparent when we do cross datacenter reads or where the 
latency to the file is HIGH. In cases, a single RPC will cost us about 120ms + 
Cost of RPC (west coast to east coast) so this seek is bad for performance.

Delaying the corrupt check also gives us many benefits in low latency env where 
we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a moderately 
sized files of 250MB.

NOTE:  The more number of log blocks to read, the greater the performance 
improvements.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to