Re: Micro-Batch Streaming

Ryan Blue Mon, 06 May 2019 15:38:45 -0700

Replies inline.

On Mon, May 6, 2019 at 3:01 PM Anton Okolnychyi <aokolnyc...@apple.com>
wrote:


> I am also wondering whether it makes sense to have a config that limits
> the number of snapshot we want to track. This config can be based on the
> number of snapshots (e.g. keep only 10000 snapshots) or based on time (e.g.
> keep snapshots for the last 7 days). We can implement both, actually.
> AFAIK, the expiration of snapshots is manual right now. Would it make sense
> to control this via config options or do we expect that users do this?
>

I'm reluctant to do this without an explicit call from the user or in a
service. The problem is when to expire snapshots. Iceberg is called
regularly to read and write tables. That might seem like a good time to
expire snapshots, but it doesn't make sense for either one to have a side
effect of physically deleting data files and discarding metadata. That's
going beyond user expectations to do destructive tasks. Plus, it changes
the guarantees of those operations, where reads should be as fast as
possible and there may be guarantees relying on writes not doing additional
operations that could cause failures.


> Spark provides queryId and epochId/batchId to all sinks, which must ensure
> that all writes are idempotent. Spark might try to commit the same batch
> multiple times. So, we need to know the latest committed batchId for every
> query. One option is to store this information in the table metadata.
> However, this breaks time traveling and rollbacks. We need to have this
> mapping per snapshot. Snapshot summary seems like a reasonable choice.
> Would it make sense to do smth similar to “total-records” and “total-files”
> to keep the latest committed batch id for each query? Any other ideas are
> welcome.
>

For Flink, we're creating a UUID for each checkpoint that writes files,
writing that into the snapshot summary, and then checking whether a known
snapshot had that ID when the write resumes after a failure. That sounds
like what you're suggesting here, but using queryId/epochId as the write
ID. Sounds like a good plan to me.

rb

-- 
Ryan Blue
Software Engineer
Netflix

Re: Micro-Batch Streaming

Reply via email to