Hi Lakshmi, +Steven Wu <[email protected]>, who wrote our Flink sink.
I can tell you a bit about how our sink works, which will hopefully help. Ours accumulates data in data files until a snapshot. When that happens, each writer closes its open data files and sends a DataFile instance to the next stage, which is a single commit task responsible for committing all the data files to the Iceberg table. When the commit task gets the notification from each writer, it prepares a commit by writing a new manifest of all the data files. Then it stages that commit information in the checkpoint. When the checkpoint succeeds, the committer commits to the Iceberg table. If the Iceberg commit fails, the new manifests will stack up and no data is lost. When the committer is running the Iceberg commit, it checks what previous checkpoints have already been committed in recent Iceberg snapshots (using an ID from each flink snapshot stored in the Iceberg summary) for exactly-once commits. Steven can probably explain it better, but that's a rough outline. rb On Fri, Nov 15, 2019 at 11:53 AM Lakshmi Rao <[email protected]> wrote: > Hi, > > I'm working on building a POC of streaming data with Flink to Iceberg for > a hackathon project. I know this issue is still open > https://github.com/apache/incubator-iceberg/issues/567 . I'm pretty > excited for the work mentioned in the issue to be open sourced and would be > happy to contribute to any tasks or tickets related to this issue! > > However, in the meantime, I'd to get a simple working version for a > flink-iceberg sink and generally explore Iceberg more. Any pointers on how > to get started? I saw this PR that enabled sinking to Iceberg with spark > structured streaming: https://github.com/apache/incubator-iceberg/pull/228 Are > there any other pointers the community can provide? > > Thanks > Lakshmi > -- Ryan Blue Software Engineer Netflix
