I think a catalog service can also use Spark/Flink procedures for table
maintenance, to utilize existing systems and cluster resources.

If we no longer support new functionality in Spark/Flink procedures, we are
effectively deprecating them, right?

Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道:

> Thanks for the responses so far!
>
> Sure, keeping the default as false makes sense because this is a new
> feature, so let's be on the safe side.
>
> About exposing setting the flag in the Spark action/procedure and also via
> Flink:
> I believe currently there are a number of vendors that don't have a
> catalog that is capable of performing table maintenance. We for instance
> advise our users to use spark procedures for table maintenance. Hence, it
> would come quite handy for us to also have a way to control the
> functionality behind the 'cleanExpiredMetadata' flag through the
> expire_snapshots procedure. Since the functionality is already there in the
> Java ExpireSnapshots API, this seems a low effort exercise.
> I'd like to avoid telling the users to call the Java API directly, but if
> extending the procedure is not an option, and also the used catalog
> implementation doesn't give support for this, I don't see what other
> possibilities we have here.
> Taking these into consideration, would it be possible to allow extending
> the Spark and Flink with the support of setting this flag?
>
> Thanks,
> Gabor
>
> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote:
>
>> I don't think it is necessary to either make cleanup the default or to
>> expose the flag in Spark or other engines.
>>
>> Right now, catalogs are taking on a lot more responsibility for things
>> like snapshot expiration, orphan file cleanup, and schema or partition spec
>> removal. Ideally, those are tasks that catalogs handle rather than having
>> clients run them, but right now we have systems for keeping tables clean
>> (i.e. expiring snapshots) that are built using clients rather than being
>> controlled through catalogs. That's not a problem and we want to continue
>> to support them, but I also don't think that we should make the problem
>> worse. I think we should consider schema and partition spec cleanup to be
>> catalog service tasks, so we should not spend much effort to make them
>> easily available to users. And we should not make them the default behavior
>> because we don't want clients removing these manually and duplicating work
>> on the client and in REST services.
>>
>> Ryan
>>
>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>>> Hi Gabor
>>>
>>> I think the question is "when". As it's a behavior change, I don't
>>> think we should do that on a "minor" release, else users would be
>>> "surprised".
>>> I would propose to keep the current behavior on Iceberg Java 1.x and
>>> change the flag to true on Iceberg Java 2.x (after a vote).
>>>
>>> Regards
>>> JB
>>>
>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <gaborkas...@apache.org>
>>> wrote:
>>> >
>>> > Hi Iceberg Community,
>>> >
>>> > There were recent additions to RemoveSnapshots to expire the unused
>>> partition specs and schemas. This is controlled by a flag called
>>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark
>>> and Flink don't offer a way to set this flag currently.
>>> >
>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata
>>> > I'm wondering if it's desired by the community to default this flag to
>>> true. The effect of that would be that each snapshot expiration would also
>>> clean up the unused partition specs and schemas too. This functionality is
>>> quite new so this might need some extra confidence by the community before
>>> turning on by default but I think it's worth a consideration.
>>> >
>>> > 2) Spark and Flink to support setting this flag
>>> > I think it makes sense to add support in Spark's
>>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's
>>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag
>>> based on (user) inputs.
>>> >
>>> > WDYT?
>>> >
>>> > Regards,
>>> > Gabor
>>>
>>

Reply via email to