[ 
https://issues.apache.org/jira/browse/HUDI-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Liu resolved HUDI-9649.
---------------------------

> Avoid DAG retriggers in StreamSync (DeltaStreamer)
> --------------------------------------------------
>
>                 Key: HUDI-9649
>                 URL: https://issues.apache.org/jira/browse/HUDI-9649
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: deltastreamer
>            Reporter: Rajesh Mahindra
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>             Fix For: 1.1.0
>
>
> In StreamSync, before calling commit, DAG gets triggered at 2 places:
> long totalErrorRecords = 
> writeStatusRDD.mapToDouble(WriteStatus::getTotalErrorRecords).sum().longValue();
> long totalRecords = 
> writeStatusRDD.mapToDouble(WriteStatus::getTotalRecords).sum().longValue();
> Although we persist the write status, for large scale ingestion batches (that 
> touch in the order of 10Ks of files), it can add 10-20mins to the commit time 
> since the RDD of write status is large.
> Implement callbacks or alike in the write client after it collects the write 
> status (before commiting). Such callbacks can be used by consumers such as 
> StreamSync to apply logic based on number of error records etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to