I do think this comes up a lot and is one of the more confusing things about the snapshot expiration. Definitely one of my most answered questions is: "When I set min-snapshots to 1, why do I not get only 1 snapshot." I agree adding another behavior may be even more confusing but I wouldn't be opposed to having it be a parameter of the existing expire snapshots action. Something like, expireAllBut(x). Setting the expiration time to 1ms and setting a number of min-snapshots has always felt a bit hacky to me but I've recommended it many times.
I am open to any change to this, because if any question comes up this many times, it is probably confusing. On Tue, Jan 21, 2025 at 2:27 PM rdb...@gmail.com <rdb...@gmail.com> wrote: > I think you could achieve what you're looking for by setting the age to 1 > ms and the minimum number of snapshots to keep. Then snapshot expiration > would always expire all snapshots other than the min number, getting you > what you want. > > It probably wouldn't make sense to set a maximum as well. Right now, the > min number of snapshots is a requirement that keeps snapshots around even > if they are eligible to be removed because of expiration. A maximum would > work differently and would be a second way to consider a snapshot eligible > for expiration -- or else we would have to redefine how the min works. I > think that would be a bit confusing to configure in practice because we'd > need to define these cases for which configuration takes precedence. It > seems much simpler to me to use the min snapshots setting with a very short > expiration interval if you want to always keep some number of snapshots > rather than using the age-based expiration. > > On Tue, Jan 21, 2025 at 9:51 AM Daniel Weeks <dwe...@apache.org> wrote: > >> Hey Manu, >> >> I think I understand what you're trying to achieve here and I feel like >> the most important part is to have an updated version of the retention >> procedure <https://iceberg.apache.org/spec/#snapshot-retention-policy> to >> clearly state how this interacts with the other settings as part of the PR. >> >> -Dan >> >> On Thu, Jan 16, 2025 at 8:37 PM Yufei Gu <flyrain...@gmail.com> wrote: >> >>> It makes sense to have an option to control the max number of snapshots. >>> Thanks Manu for the proposal. >>> >>> Yufei >>> >>> >>> On Thu, Jan 16, 2025 at 7:46 PM Manu Zhang <owenzhang1...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> Do you have more comments on this feature? Do you have concerns about >>>> adding a new field to SnapshotRef? >>>> >>>> Thanks, >>>> Manu >>>> >>>> On Tue, Jan 7, 2025 at 2:37 PM Manu Zhang <owenzhang1...@gmail.com> >>>> wrote: >>>> >>>>> Hi Ajantha, >>>>> >>>>> `history.expire.min-snapshots-to-keep` is the *minimum number of >>>>> snapshots* we can keep. I'm proposing to decide the *maximum number >>>>> of snapshots* to keep by count rather than by age. >>>>> >>>>> Thanks, >>>>> Manu >>>>> >>>>> On Tue, Jan 7, 2025 at 2:18 PM Ajantha Bhat <ajanthab...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Manu, >>>>>> >>>>>> We already have `retain_last` and >>>>>> `history.expire.min-snapshots-to-keep` to retain the snapshots based on >>>>>> count. Can you please elaborate on why can't we use the same? >>>>>> >>>>>> - Ajantha >>>>>> >>>>>> On Tue, Jan 7, 2025 at 11:33 AM Walaa Eldin Moustafa < >>>>>> wa.moust...@gmail.com> wrote: >>>>>> >>>>>>> Thanks Manu for starting this discussion. That is definitely a valid >>>>>>> feature. I have always found maintaining snapshots by day makes it >>>>>>> harder >>>>>>> to provide different types of guarantees/contracts especially when >>>>>>> tables >>>>>>> change rates are diverse or irregular. Maintaining by snapshot count >>>>>>> makes >>>>>>> a lot of sense and prevents table sizes from growing excessively when >>>>>>> change rate is frequent. >>>>>>> >>>>>>> Thanks, >>>>>>> Walaa. >>>>>>> >>>>>>> >>>>>>> On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang <owenzhang1...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> While maintaining Iceberg tables for our customers, I find it's >>>>>>>> difficult to set a default snapshot expiration time >>>>>>>> (`history.expire.max-snapshot-age-ms`) for different workloads. The >>>>>>>> default >>>>>>>> value of 5 days looks good for daily batch jobs but is too long for >>>>>>>> frequently-updated jobs. >>>>>>>> >>>>>>>> I'm thinking about adding another option like >>>>>>>> `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A >>>>>>>> snapshot will be removed when either its age is larger than >>>>>>>> `history.expire.max-snapshot-age-ms` or it's the oldest in >>>>>>>> `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a >>>>>>>> draft >>>>>>>> PR to demo the idea[1]. >>>>>>>> >>>>>>>> If you agree this is a valid feature request, we also need to >>>>>>>> update SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will >>>>>>>> there be a compatibility issue or too much cost to maintain >>>>>>>> compatibility? >>>>>>>> My experiment shows many parsers need to be updated. >>>>>>>> >>>>>>>> I'd like to hear your thoughts on this. >>>>>>>> >>>>>>>> 1. https://github.com/apache/iceberg/pull/11879 >>>>>>>> 2. https://iceberg.apache.org/spec/#snapshot-references >>>>>>>> >>>>>>>> Happy New Year! >>>>>>>> Manu >>>>>>>> >>>>>>>