[
https://issues.apache.org/jira/browse/HUDI-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17372973#comment-17372973
]
Rajesh Mahindra commented on HUDI-2118:
---------------------------------------
Based on benchmarking merges in Hudi, on the read side, we realized that
HoodieLogFileReader checks for corrupt blocks (partially written files that may
have partial logs that need to be discarded).
For cloud stores such as S3/ GCS, they only support transactional writes ie., a
file is either completely written or not written at all (we verified this on
both s3 and gcs)
[https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html]
Hence for such stores, we can skip checking isBlockCorrupt (see code below)
while reading blocks in HoodieLogFileReader. Our benchmark results show that
this could be 100's of msecs for larger file sizes.
boolean isCorrupted = isBlockCorrupt(blocksize);
if (isCorrupted) {
return createCorruptBlock();
}
How to proceed: In StorageSchemes.java, we should add another attribute:
transactional writes, and set it to true for S3 and GCS, while false for all
others such as HDFS which do stream writes.
Then, in HoodieLogFileReader we can skip calling isBlockCorrupt method in the
readBlock method if the storage scheme supports transactional writes.
> Avoid checking corrupt log blocks for cloud storage
> ---------------------------------------------------
>
> Key: HUDI-2118
> URL: https://issues.apache.org/jira/browse/HUDI-2118
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Rajesh Mahindra
> Priority: Minor
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)