Hey Manu,

I think I understand what you're trying to achieve here and I feel like the
most important part is to have an updated version of the retention procedure
<https://iceberg.apache.org/spec/#snapshot-retention-policy> to clearly
state how this interacts with the other settings as part of the PR.

-Dan

On Thu, Jan 16, 2025 at 8:37 PM Yufei Gu <flyrain...@gmail.com> wrote:

> It makes sense to have an option to control the max number of snapshots.
> Thanks Manu for the proposal.
>
> Yufei
>
>
> On Thu, Jan 16, 2025 at 7:46 PM Manu Zhang <owenzhang1...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> Do you have more comments on this feature? Do you have concerns about
>> adding a new field to SnapshotRef?
>>
>> Thanks,
>> Manu
>>
>> On Tue, Jan 7, 2025 at 2:37 PM Manu Zhang <owenzhang1...@gmail.com>
>> wrote:
>>
>>> Hi Ajantha,
>>>
>>> `history.expire.min-snapshots-to-keep` is the *minimum number of
>>> snapshots* we can keep. I'm proposing to decide the *maximum number of
>>> snapshots* to keep by count rather than by age.
>>>
>>> Thanks,
>>> Manu
>>>
>>> On Tue, Jan 7, 2025 at 2:18 PM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> Hi Manu,
>>>>
>>>> We already have `retain_last` and
>>>> `history.expire.min-snapshots-to-keep` to retain the snapshots based on
>>>> count. Can you please elaborate on why can't we use the same?
>>>>
>>>> - Ajantha
>>>>
>>>> On Tue, Jan 7, 2025 at 11:33 AM Walaa Eldin Moustafa <
>>>> wa.moust...@gmail.com> wrote:
>>>>
>>>>> Thanks Manu for starting this discussion. That is definitely a valid
>>>>> feature. I have always found maintaining snapshots by day makes it harder
>>>>> to provide different types of guarantees/contracts especially when tables
>>>>> change rates are diverse or irregular. Maintaining by snapshot count makes
>>>>> a lot of sense and prevents table sizes from growing excessively when
>>>>> change rate is frequent.
>>>>>
>>>>> Thanks,
>>>>> Walaa.
>>>>>
>>>>>
>>>>> On Mon, Jan 6, 2025 at 8:38 PM Manu Zhang <owenzhang1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> While maintaining Iceberg tables for our customers, I find it's
>>>>>> difficult to set a default snapshot expiration time
>>>>>> (`history.expire.max-snapshot-age-ms`) for different workloads. The 
>>>>>> default
>>>>>> value of 5 days looks good for daily batch jobs but is too long for
>>>>>> frequently-updated jobs.
>>>>>>
>>>>>> I'm thinking about adding another option like
>>>>>> `history.expire.max-snapshots-to-keep` to keep at most N snapshots. A
>>>>>> snapshot will be removed when either its age is larger than
>>>>>> `history.expire.max-snapshot-age-ms` or it's the oldest in
>>>>>> `history.expire.max-snapshots-to-keep + 1` snapshots. I've created a 
>>>>>> draft
>>>>>> PR to demo the idea[1].
>>>>>>
>>>>>> If you agree this is a valid feature request, we also need to update
>>>>>> SnapshotRef[2] adding a new field `max-snapshots-to-keep`. Will there be 
>>>>>> a
>>>>>> compatibility issue or too much cost to maintain compatibility? My
>>>>>> experiment shows many parsers need to be updated.
>>>>>>
>>>>>> I'd like to hear your thoughts on this.
>>>>>>
>>>>>> 1. https://github.com/apache/iceberg/pull/11879
>>>>>> 2. https://iceberg.apache.org/spec/#snapshot-references
>>>>>>
>>>>>> Happy New Year!
>>>>>> Manu
>>>>>>
>>>>>

Reply via email to