Dongwook Kwon created HUDI-993:
----------------------------------
Summary: Use hoodie.delete.shuffle.parallelism for Delete API
Key: HUDI-993
URL: https://issues.apache.org/jira/browse/HUDI-993
Project: Apache Hudi
Issue Type: Improvement
Components: Performance
Reporter: Dongwook Kwon
While HUDI-328 introduced Delete API, I noticed
[deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57]
method doesn't allow any parallelism for RDD operation while
[deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104]
for upsert uses parallelism on RDD.
{{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}}
I found certain cases, like input RDD has few parallelism but target table has
large files, certain Spark job's performance is suffered from low parallelism.
so in this case, upsert performance with "EmptyHoodieRecordPayload" is faster
than delete API.
Also this is due to the fact that "hoodie.combine.before.upsert" is true by
default, when it's not enabled, the issue would be the same.
So I wonder input RDD should be repartition as
"hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is
false for better performance regardless of "hoodie.combine.before.delete"
--
This message was sent by Atlassian Jira
(v8.3.4#803005)