[
https://issues.apache.org/jira/browse/HUDI-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yanjia Gary Li reopened HUDI-415:
---------------------------------
> HoodieSparkSqlWriter Commit time not representing the Spark job starting time
> -----------------------------------------------------------------------------
>
> Key: HUDI-415
> URL: https://issues.apache.org/jira/browse/HUDI-415
> Project: Apache Hudi (incubating)
> Issue Type: Bug
> Reporter: Yanjia Gary Li
> Assignee: Yanjia Gary Li
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.5.1
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Hudi records the commit time after the first action complete. If there is a
> heavy transformation before isEmpty(), then the commit time could be
> inaccurate.
> {code:java}
> if (hoodieRecords.isEmpty()) {
> log.info("new batch has no new records, skipping...")
> return (true, common.util.Option.empty())
> }
> commitTime = client.startCommit()
> writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords,
> commitTime, operation)
> {code}
> For example, I start the spark job at 201901010000, but *isEmpty()* ran for 2
> hours, then the commit time in the .hoodie folder will be 201901010*2*00. If
> I use the commit time to ingest data starting from 201901010200(from HDFS,
> not using deltastreamer), then I will miss 2 hours of data.
> Is this set up intended? Can we move the commit time before isEmpty()?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)