[jira] [Commented] (HUDI-2118) Avoid checking corrupt log blocks for cloud storage

Rajesh Mahindra (Jira) Thu, 01 Jul 2021 10:48:05 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17372973#comment-17372973
 ]


Rajesh Mahindra commented on HUDI-2118:
---------------------------------------

Based on benchmarking merges in Hudi, on the read side, we realized that 
HoodieLogFileReader checks for corrupt blocks (partially written files that may 
have partial logs that need to be discarded). 

For cloud stores such as S3/ GCS, they only support transactional writes ie., a 
file is either completely written or not written at all (we verified this on 
both s3 and gcs) 
[https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html]

 

Hence for such stores, we can skip checking isBlockCorrupt (see code below) 
while reading blocks in HoodieLogFileReader. Our benchmark results show that 
this could be 100's of msecs for larger file sizes.

boolean isCorrupted = isBlockCorrupt(blocksize);
if (isCorrupted) {
 return createCorruptBlock();
}

 

How to proceed: In StorageSchemes.java, we should add another attribute: 
transactional writes, and set it to true for S3 and GCS, while false for all 
others such as HDFS which do stream writes. 

Then, in HoodieLogFileReader we can skip calling isBlockCorrupt method in the 
readBlock method if the storage scheme supports transactional writes.

> Avoid checking corrupt log blocks for cloud storage
> ---------------------------------------------------
>
>                 Key: HUDI-2118
>                 URL: https://issues.apache.org/jira/browse/HUDI-2118
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Rajesh Mahindra
>            Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2118) Avoid checking corrupt log blocks for cloud storage

Reply via email to