To get around this, my team expires snapshots based on the number of snapshots rather than by time. For example, if the reader jobs is scheduled to consume 2k snapshot increments, we have a cron to retain the last 10k snapshots.
That gives enough time to unclog the pipeline if the read job gets stuck. > On Jan 12, 2021, at 9:44 AM, Ryan Blue <[email protected]> wrote: > > > I agree with Russell. The right thing to do is to keep snapshots around long > enough that you don't lose the history. I don't think that changing the read > logic would work because you need all the history to ensure you aren't > skipping a snapshot. > >> On Mon, Jan 11, 2021 at 11:24 AM Russell Spitzer <[email protected]> >> wrote: >> I would probably try to just extend my expiration interval if that was >> possible to fix the issue since it's basically functioning as a watermark >> for state at the moment. >> >> Is our underlying issue here that we cannot determine the lineage of a >> Snapshot that has been expired? IE: We know all the files and which >> snapshots added them, but we cannot determine where our "From" snapshot >> exists in history since we did the expiration? >> >>> On Mon, Jan 11, 2021 at 11:07 AM Filip <[email protected]> wrote: >>> Hi team, >>> >>> We've recently bumped into an issue with a particular edge case that messes >>> with our implementation of leveraging the incremental read and the expire >>> snapshot features combined. >>> >>> With incremental read we're relying on the client to preserve the snapshot >>> that was last used for reading data as a checkpoint. Every time the client >>> does an incremental read it gets new data (if available) along with the >>> current snapshot that the client will store along as its new checkpoint. >>> >>> Expire snapshot is scheduled to kick in and wipe snapshots based on recency >>> (say older than N days). >>> But in the edge-case of two consecutive write operations happening less >>> often than the expiration interval (*) if the incremental read process >>> doesn't run before the snapshot expiration then the client will be left in >>> an inconsistent state since the snapshot it has stored as checkpoint is not >>> going to work anymore. >>> >>> So we were looking at either extending the snapshot expiration feature or >>> extending the implementation of incremental read. >>> >>> I'll just drop-in some details on exploring the solution to extend >>> incremental read - extend it by adding a fallback logic when the provided >>> snapshot is missing and try to locate the snapshot parented by that >>> particular snapshot instead. >>> This would change the logic of the incremental read with respect to >>> inclusiveness of loading the snapshots, if it currently considers the >>> provided "from" snapshot as exclusive, in the case of the fallback logic to >>> using the child snapshot as "from" it would have to be inclusive. >>> >>> Let me know if you think this edge-case should be supported by Iceberg and >>> if this idea of extending the incremental read logic makes sense or if >>> folks in the community have a better solution for this. >>> >>> (*) We expire snapshots older than 10 days but we observe two consecutive >>> write operations 11 days apart. >>> >>> -- >>> Filip Bocse > > > -- > Ryan Blue > Software Engineer > Netflix
