[jira] [Created] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

Yanjia Gary Li (Jira) Mon, 16 Dec 2019 18:18:50 -0800

Yanjia Gary Li created HUDI-415:
-----------------------------------

             Summary: HoodieSparkSqlWriter Commit time not representing the 
Spark job starting time
                 Key: HUDI-415
                 URL: https://issues.apache.org/jira/browse/HUDI-415
             Project: Apache Hudi (incubating)
          Issue Type: Bug
            Reporter: Yanjia Gary Li
            Assignee: Yanjia Gary Li



Hudi records the commit time after the first action complete. If there is a 
heavy transformation before isEmpty(), then the commit time could be inaccurate.
{code:java}
if (hoodieRecords.isEmpty()) { 
log.info("new batch has no new records, skipping...") 
return (true, common.util.Option.empty()) 
} 
commitTime = client.startCommit() 
writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
commitTime, operation)
{code}
For example, I start the spark job at 201901010000, but *isEmpty()* ran for 2 
hours, then the commit time in the .hoodie folder will be 201901010*2*00. If I 
use the commit time to ingest data starting from 201901010200(from HDFS, not 
using deltastreamer), then I will miss 2 hours of data.

Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

Reply via email to