I do think this comes up a lot and is one of the more confusing things
about the snapshot expiration. Definitely one of my most answered questions
is: "When I set min-snapshots to 1, why do I not get only 1 snapshot." I
agree adding another behavior may be even more confusing but I wouldn't be
opposed to having it be a parameter of the existing expire snapshots
action. Something like, expireAllBut(x). Setting the expiration time to 1ms
and setting a number of min-snapshots has always felt a bit hacky to me but
I've recommended it many times.

I am open to any change to this, because if any question comes up this many
times, it is probably confusing.

On Tue, Jan 21, 2025 at 2:27 PM rdb...@gmail.com <rdb...@gmail.com> wrote:

> I think you could achieve what you're looking for by setting the age to 1
> ms and the minimum number of snapshots to keep. Then snapshot expiration
> would always expire all snapshots other than the min number, getting you
> what you want.
>
> It probably wouldn't make sense to set a maximum as well. Right now, the
> min number of snapshots is a requirement that keeps snapshots around even
> if they are eligible to be removed because of expiration. A maximum would
> work differently and would be a second way to consider a snapshot eligible
> for expiration -- or else we would have to redefine how the min works. I
> think that would be a bit confusing to configure in practice because we'd
> need to define these cases for which configuration takes precedence. It
> seems much simpler to me to use the min snapshots setting with a very short
> expiration interval if you want to always keep some number of snapshots
> rather than using the age-based expiration.
>
> On Tue, Jan 21, 2025 at 9:51 AM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Hey Manu,
>>
>> I think I understand what you're trying to achieve here and I feel like
>> the most important part is to have an updated version of the retention
>> procedure <https://iceberg.apache.org/spec/#snapshot-retention-policy> to
>> clearly state how this interacts with the other settings as part of the PR.
>>
>> -Dan
>>
>> On Thu, Jan 16, 2025 at 8:37 PM Yufei Gu <flyrain...@gmail.com> wrote:
>>
>>> It makes sense to have an option to control the max number of snapshots.
>>> Thanks Manu for the proposal.
>>>
>>> Yufei
>>>
>>>
>>> On Thu, Jan 16, 2025 at 7:46 PM Manu Zhang <owenzhang1...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Do you have more comments on this feature? Do you have concerns about
>>>> adding a new field to SnapshotRef?
>>>>
>>>> Thanks,
>>>> Manu
>>>>
>>>> On Tue, Jan 7, 2025 at 2:37 PM Manu Zhang <owenzhang1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ajantha,
>>>>>
>>>>> `history.expire.min-snapshots-to-keep` is the *minimum number of
>>>>> snapshots* we can keep. I'm proposing to decide the *maximum number
>>>>> of snapshots* to keep by count rather than by age.
>>>>>
>>>>> Thanks,
>>>>> Manu
>>>>>
>>>>> On Tue, Jan 7, 2025 at 2:18 PM Ajantha Bhat <ajanthab...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Manu,
>>>>>>
>>>>>> We already have `retain_last` and
>>>>>> `history.expire.min-snapshots-to-keep` to retain the snapshots based on
>>>>>> count. Can you please elaborate on why can't we use the same?
>>>>>>
>>>>>> - Ajantha
>>>>>>
>>>>>> On Tue, Jan 7, 2025 at 11:33 AM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Manu for starting this discussion. That is definitely a valid
>>>>>>> feature. I have always found maintaining snapshots by day makes it 
>>>>>>> harder
>>>>>>> to provide different types of guarantees/contracts especially when 
>>>>>>> tables
>>>>>>> change rates are diverse or irregular. Maintaining by snapshot count 
>>>>>>> makes
>>>>>>> a lot of sense and prevents table sizes from growing excessively when
>>>>>>> change rate is frequent.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Walaa.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang <owenzhang1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> While maintaining Iceberg tables for our customers, I find it's
>>>>>>>> difficult to set a default snapshot expiration time
>>>>>>>> (`history.expire.max-snapshot-age-ms`) for different workloads. The 
>>>>>>>> default
>>>>>>>> value of 5 days looks good for daily batch jobs but is too long for
>>>>>>>> frequently-updated jobs.
>>>>>>>>
>>>>>>>> I'm thinking about adding another option like
>>>>>>>> `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A
>>>>>>>> snapshot will be removed when either its age is larger than
>>>>>>>> `history.expire.max-snapshot-age-ms` or it's the oldest in
>>>>>>>> `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a 
>>>>>>>> draft
>>>>>>>> PR to demo the idea[1].
>>>>>>>>
>>>>>>>> If you agree this is a valid feature request, we also need to
>>>>>>>> update SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will
>>>>>>>> there be a compatibility issue or too much cost to maintain 
>>>>>>>> compatibility?
>>>>>>>> My experiment shows many parsers need to be updated.
>>>>>>>>
>>>>>>>> I'd like to hear your thoughts on this.
>>>>>>>>
>>>>>>>> 1. https://github.com/apache/iceberg/pull/11879
>>>>>>>> 2. https://iceberg.apache.org/spec/#snapshot-references
>>>>>>>>
>>>>>>>> Happy New Year!
>>>>>>>> Manu
>>>>>>>>
>>>>>>>

Reply via email to