Thanks all for the discussion. I also agree that we should make this
behavior turned off by default. And I would also love to see this flag be
added to the Spark/ Flink procedure. I think having this feature
available on the client side seems more achievable in the short run and
designing a server side solution might take more time (i.e. spec change,
vendor implementation etc).

On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org> wrote:

> Thanks for the responses!
>
> My concern is the same, Manu, Peter: many stakeholders in this community
> don't have a catalog that is capable of executing table maintenance (e.g.
> HiveCatalog) and rely on the Spark procedures and actions for this purpose.
> I feel that we should still give them the new functionality to clean
> expired metadata (specs, schemas) by extending the Spark and Flink
> procedures.
>
> Regards,
> Gabor
>
> On Wed, Mar 26, 2025 at 2:59 PM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> I know of several companies who are using either scheduled stored
>> procedures or the existing actions to maintain production tables.
>> I don't think we should deprecate them until there is a viable open
>> solution for them.
>>
>> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc. 19.,
>> Sze, 17:52):
>>
>>> I think a catalog service can also use Spark/Flink procedures for table
>>> maintenance, to utilize existing systems and cluster resources.
>>>
>>> If we no longer support new functionality in Spark/Flink procedures, we
>>> are effectively deprecating them, right?
>>>
>>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道:
>>>
>>>> Thanks for the responses so far!
>>>>
>>>> Sure, keeping the default as false makes sense because this is a new
>>>> feature, so let's be on the safe side.
>>>>
>>>> About exposing setting the flag in the Spark action/procedure and also
>>>> via Flink:
>>>> I believe currently there are a number of vendors that don't have a
>>>> catalog that is capable of performing table maintenance. We for instance
>>>> advise our users to use spark procedures for table maintenance. Hence, it
>>>> would come quite handy for us to also have a way to control the
>>>> functionality behind the 'cleanExpiredMetadata' flag through the
>>>> expire_snapshots procedure. Since the functionality is already there in the
>>>> Java ExpireSnapshots API, this seems a low effort exercise.
>>>> I'd like to avoid telling the users to call the Java API directly, but
>>>> if extending the procedure is not an option, and also the used catalog
>>>> implementation doesn't give support for this, I don't see what other
>>>> possibilities we have here.
>>>> Taking these into consideration, would it be possible to allow
>>>> extending the Spark and Flink with the support of setting this flag?
>>>>
>>>> Thanks,
>>>> Gabor
>>>>
>>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote:
>>>>
>>>>> I don't think it is necessary to either make cleanup the default or to
>>>>> expose the flag in Spark or other engines.
>>>>>
>>>>> Right now, catalogs are taking on a lot more responsibility for things
>>>>> like snapshot expiration, orphan file cleanup, and schema or partition 
>>>>> spec
>>>>> removal. Ideally, those are tasks that catalogs handle rather than having
>>>>> clients run them, but right now we have systems for keeping tables clean
>>>>> (i.e. expiring snapshots) that are built using clients rather than being
>>>>> controlled through catalogs. That's not a problem and we want to continue
>>>>> to support them, but I also don't think that we should make the problem
>>>>> worse. I think we should consider schema and partition spec cleanup to be
>>>>> catalog service tasks, so we should not spend much effort to make them
>>>>> easily available to users. And we should not make them the default 
>>>>> behavior
>>>>> because we don't want clients removing these manually and duplicating work
>>>>> on the client and in REST services.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>>> Hi Gabor
>>>>>>
>>>>>> I think the question is "when". As it's a behavior change, I don't
>>>>>> think we should do that on a "minor" release, else users would be
>>>>>> "surprised".
>>>>>> I would propose to keep the current behavior on Iceberg Java 1.x and
>>>>>> change the flag to true on Iceberg Java 2.x (after a vote).
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <gaborkas...@apache.org>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi Iceberg Community,
>>>>>> >
>>>>>> > There were recent additions to RemoveSnapshots to expire the unused
>>>>>> partition specs and schemas. This is controlled by a flag called
>>>>>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, 
>>>>>> Spark
>>>>>> and Flink don't offer a way to set this flag currently.
>>>>>> >
>>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata
>>>>>> > I'm wondering if it's desired by the community to default this flag
>>>>>> to true. The effect of that would be that each snapshot expiration would
>>>>>> also clean up the unused partition specs and schemas too. This
>>>>>> functionality is quite new so this might need some extra confidence by 
>>>>>> the
>>>>>> community before turning on by default but I think it's worth a
>>>>>> consideration.
>>>>>> >
>>>>>> > 2) Spark and Flink to support setting this flag
>>>>>> > I think it makes sense to add support in Spark's
>>>>>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's
>>>>>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag
>>>>>> based on (user) inputs.
>>>>>> >
>>>>>> > WDYT?
>>>>>> >
>>>>>> > Regards,
>>>>>> > Gabor
>>>>>>
>>>>>

Reply via email to