[
https://issues.apache.org/jira/browse/HUDI-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lin Liu resolved HUDI-9649.
---------------------------
> Avoid DAG retriggers in StreamSync (DeltaStreamer)
> --------------------------------------------------
>
> Key: HUDI-9649
> URL: https://issues.apache.org/jira/browse/HUDI-9649
> Project: Apache Hudi
> Issue Type: Improvement
> Components: deltastreamer
> Reporter: Rajesh Mahindra
> Assignee: sivabalan narayanan
> Priority: Blocker
> Fix For: 1.1.0
>
>
> In StreamSync, before calling commit, DAG gets triggered at 2 places:
> long totalErrorRecords =
> writeStatusRDD.mapToDouble(WriteStatus::getTotalErrorRecords).sum().longValue();
> long totalRecords =
> writeStatusRDD.mapToDouble(WriteStatus::getTotalRecords).sum().longValue();
> Although we persist the write status, for large scale ingestion batches (that
> touch in the order of 10Ks of files), it can add 10-20mins to the commit time
> since the RDD of write status is large.
> Implement callbacks or alike in the write client after it collects the write
> status (before commiting). Such callbacks can be used by consumers such as
> StreamSync to apply logic based on number of error records etc.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)