Dongwook Kwon created HUDI-993:
----------------------------------

             Summary: Use hoodie.delete.shuffle.parallelism for Delete API
                 Key: HUDI-993
                 URL: https://issues.apache.org/jira/browse/HUDI-993
             Project: Apache Hudi
          Issue Type: Improvement
          Components: Performance
            Reporter: Dongwook Kwon


While HUDI-328 introduced Delete API, I noticed 
[deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57]
 method doesn't allow any parallelism for RDD operation while 
[deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104]
 for upsert uses parallelism on RDD.

{{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}}

 

I found certain cases, like input RDD has few parallelism but target table has 
large files, certain Spark job's performance is suffered from low parallelism. 
so in this case,  upsert performance with "EmptyHoodieRecordPayload" is faster 
than delete API.

Also this is due to the fact that "hoodie.combine.before.upsert" is true by 
default, when it's not enabled, the issue would be the same.

So I wonder input RDD should be repartition as 
"hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is 
false for better performance regardless of "hoodie.combine.before.delete"

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to