hudi-bot opened a new issue, #15911: URL: https://github.com/apache/hudi/issues/15911
The code currently does an eager isCorruptedCheck for which we do a seek and then a read which invalidates our internal buffers in opened file stream to the log file and makes a call to DataNode to start a new blockReader. The seek + read becomes apparent when we do cross datacenter reads or where the latency to the file is HIGH. In cases, a single RPC will cost us about 120ms + Cost of RPC (west coast to east coast) so this seek is bad for performance. Delaying the corrupt check also gives us many benefits in low latency env where we see times reducing from (5 to 8 sec) to (3s to < 500ms) for a moderately sized files of 250MB. NOTE: The more number of log blocks to read, the greater the performance improvements. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-6116 - Type: Improvement -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
