Hi Lakshmi,

+Steven Wu <[email protected]>, who wrote our Flink sink.

I can tell you a bit about how our sink works, which will hopefully help.
Ours accumulates data in data files until a snapshot. When that happens,
each writer closes its open data files and sends a DataFile instance to the
next stage, which is a single commit task responsible for committing all
the data files to the Iceberg table. When the commit task gets the
notification from each writer, it prepares a commit by writing a new
manifest of all the data files. Then it stages that commit information in
the checkpoint. When the checkpoint succeeds, the committer commits to the
Iceberg table. If the Iceberg commit fails, the new manifests will stack up
and no data is lost. When the committer is running the Iceberg commit, it
checks what previous checkpoints have already been committed in recent
Iceberg snapshots (using an ID from each flink snapshot stored in the
Iceberg summary) for exactly-once commits.

Steven can probably explain it better, but that's a rough outline.

rb

On Fri, Nov 15, 2019 at 11:53 AM Lakshmi Rao <[email protected]> wrote:

> Hi,
>
> I'm working on building a POC of streaming data with Flink to Iceberg for
> a hackathon project. I know this issue is still open
> https://github.com/apache/incubator-iceberg/issues/567 . I'm pretty
> excited for the work mentioned in the issue to be open sourced and would be
> happy to contribute to any tasks or tickets related to this issue!
>
> However, in the meantime, I'd to get a simple working version for a
> flink-iceberg sink and generally explore Iceberg more. Any pointers on how
> to get started? I saw this PR that enabled sinking to Iceberg with spark
> structured streaming: https://github.com/apache/incubator-iceberg/pull/228 Are
> there any other pointers the community can provide?
>
> Thanks
> Lakshmi
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to