openinx commented on pull request #1515: URL: https://github.com/apache/iceberg/pull/1515#issuecomment-708196162
We have few discussion in our team, for the first question how do we distinguish the batch job and streaming job without checkpoint states. In current flink 1.11, there's no way to indicate it's batch job, so we could only extend the `BoundedOneInput` interface to do the iceberg transaction commit. In theory, we shouldn't break the big transaction into several small transactions when in batch mode because users would expected the job to be committed successfully or rollback in atomically. Currently, we may could set a property in iceberg flink sink, to indicate whether it's batch or streaming job explicitly. In future flink release, there will be methods to accomplish. For the second question, at-least-once or at-most-once ? If the kafka source have enough data that we flink job could start from, then we don't loss any data from the source operator then we have the at-least-once guarantee. For less duplication when recovering, there's no flink interface to keep the latest successful consumed offset in iceberg sink I think, if someone really want to do that, then could use the system timestamp or user-defined field which could persist in iceberg table properties. For now, I totally agree with @rdblue that have a check to throw the exception that iceberg don't support streaming job with checkpoint disabled. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
