Thanks for the responses so far!

Sure, keeping the default as false makes sense because this is a new
feature, so let's be on the safe side.

About exposing setting the flag in the Spark action/procedure and also via
Flink:
I believe currently there are a number of vendors that don't have a catalog
that is capable of performing table maintenance. We for instance advise our
users to use spark procedures for table maintenance. Hence, it would come
quite handy for us to also have a way to control the functionality behind
the 'cleanExpiredMetadata' flag through the expire_snapshots procedure.
Since the functionality is already there in the Java ExpireSnapshots API,
this seems a low effort exercise.
I'd like to avoid telling the users to call the Java API directly, but if
extending the procedure is not an option, and also the used catalog
implementation doesn't give support for this, I don't see what other
possibilities we have here.
Taking these into consideration, would it be possible to allow extending
the Spark and Flink with the support of setting this flag?

Thanks,
Gabor

On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote:

> I don't think it is necessary to either make cleanup the default or to
> expose the flag in Spark or other engines.
>
> Right now, catalogs are taking on a lot more responsibility for things
> like snapshot expiration, orphan file cleanup, and schema or partition spec
> removal. Ideally, those are tasks that catalogs handle rather than having
> clients run them, but right now we have systems for keeping tables clean
> (i.e. expiring snapshots) that are built using clients rather than being
> controlled through catalogs. That's not a problem and we want to continue
> to support them, but I also don't think that we should make the problem
> worse. I think we should consider schema and partition spec cleanup to be
> catalog service tasks, so we should not spend much effort to make them
> easily available to users. And we should not make them the default behavior
> because we don't want clients removing these manually and duplicating work
> on the client and in REST services.
>
> Ryan
>
> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi Gabor
>>
>> I think the question is "when". As it's a behavior change, I don't
>> think we should do that on a "minor" release, else users would be
>> "surprised".
>> I would propose to keep the current behavior on Iceberg Java 1.x and
>> change the flag to true on Iceberg Java 2.x (after a vote).
>>
>> Regards
>> JB
>>
>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <gaborkas...@apache.org>
>> wrote:
>> >
>> > Hi Iceberg Community,
>> >
>> > There were recent additions to RemoveSnapshots to expire the unused
>> partition specs and schemas. This is controlled by a flag called
>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark
>> and Flink don't offer a way to set this flag currently.
>> >
>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata
>> > I'm wondering if it's desired by the community to default this flag to
>> true. The effect of that would be that each snapshot expiration would also
>> clean up the unused partition specs and schemas too. This functionality is
>> quite new so this might need some extra confidence by the community before
>> turning on by default but I think it's worth a consideration.
>> >
>> > 2) Spark and Flink to support setting this flag
>> > I think it makes sense to add support in Spark's
>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's
>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag
>> based on (user) inputs.
>> >
>> > WDYT?
>> >
>> > Regards,
>> > Gabor
>>
>

Reply via email to