[
https://issues.apache.org/jira/browse/HUDI-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456010#comment-17456010
]
sivabalan narayanan commented on HUDI-1214:
-------------------------------------------
I guess this is the ask.
Add ability to serialize checkpoint via spark datasource writes. and then if
users starts up deltastreamer, it automatically resumes from last known
checkpoint.
Here is my take on this ask:
Deltastreamer uses Source interface and hence we have ways to determine
checkpoints for diff sources and the checkpoint format also differs from one
source to another. Spark datasource writers don't use any of these.
And one of the typical use-case could be,
bootstrap data from a source folder using sparkdatasource and then start a
deltastreamer with kafka source. So, the checkpoint formats may also differ.
Anyways, as of today, spark datasource does not have a way to determine the
checkpoints. I will close this ticket out. But please free to re-open is my
understanding is wrong, or if you have ideas to go about this.
> Need ability to set deltastreamer checkpoints when doing Spark datasource
> writes
> --------------------------------------------------------------------------------
>
> Key: HUDI-1214
> URL: https://issues.apache.org/jira/browse/HUDI-1214
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Spark Integration
> Reporter: Balaji Varadarajan
> Assignee: Trevorzhang
> Priority: Major
> Labels: sev:high, user-support-issues
> Fix For: 0.11.0
>
>
> Such support is needed for bootstrapping cases when users use spark write to
> do initial bootstrap and then subsequently use deltastreamer.
> DeltaStreamer manages checkpoints inside hoodie commit files and expects
> checkpoints in previously committed metadata. Users are expected to pass
> checkpoint or initial checkpoint provider when performing bootstrap through
> deltastreamer. Such support is not present when doing bootstrap using Spark
> Datasource.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)