[
https://issues.apache.org/jira/browse/HUDI-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar closed HUDI-993.
-------------------------------
Resolution: Fixed
> Use hoodie.delete.shuffle.parallelism for Delete API
> ----------------------------------------------------
>
> Key: HUDI-993
> URL: https://issues.apache.org/jira/browse/HUDI-993
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Performance
> Reporter: Dongwook Kwon
> Priority: Minor
> Labels: pull-request-available
> Fix For: 0.7.0
>
>
> While HUDI-328 introduced Delete API, I noticed
> [deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57]
> method doesn't allow any parallelism for RDD operation while
> [deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104]
> for upsert uses parallelism on RDD.
> {{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}}
>
> I found certain cases, like input RDD has few parallelism but target table
> has large files, certain Spark job's performance is suffered from low
> parallelism. so in this case, upsert performance with
> "EmptyHoodieRecordPayload" is faster than delete API.
> Also this is due to the fact that "hoodie.combine.before.upsert" is true by
> default, when it's not enabled, the issue would be the same.
> So I wonder input RDD should be repartition as
> "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is
> false for better performance regardless of "hoodie.combine.before.delete"
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)