Thanks all for the discussion. I also agree that we should make this behavior turned off by default. And I would also love to see this flag be added to the Spark/ Flink procedure. I think having this feature available on the client side seems more achievable in the short run and designing a server side solution might take more time (i.e. spec change, vendor implementation etc).
On Wed, Mar 26, 2025 at 8:17 AM Gabor Kaszab <gaborkas...@apache.org> wrote: > Thanks for the responses! > > My concern is the same, Manu, Peter: many stakeholders in this community > don't have a catalog that is capable of executing table maintenance (e.g. > HiveCatalog) and rely on the Spark procedures and actions for this purpose. > I feel that we should still give them the new functionality to clean > expired metadata (specs, schemas) by extending the Spark and Flink > procedures. > > Regards, > Gabor > > On Wed, Mar 26, 2025 at 2:59 PM Péter Váry <peter.vary.apa...@gmail.com> > wrote: > >> I know of several companies who are using either scheduled stored >> procedures or the existing actions to maintain production tables. >> I don't think we should deprecate them until there is a viable open >> solution for them. >> >> Manu Zhang <owenzhang1...@gmail.com> ezt írta (időpont: 2025. márc. 19., >> Sze, 17:52): >> >>> I think a catalog service can also use Spark/Flink procedures for table >>> maintenance, to utilize existing systems and cluster resources. >>> >>> If we no longer support new functionality in Spark/Flink procedures, we >>> are effectively deprecating them, right? >>> >>> Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道: >>> >>>> Thanks for the responses so far! >>>> >>>> Sure, keeping the default as false makes sense because this is a new >>>> feature, so let's be on the safe side. >>>> >>>> About exposing setting the flag in the Spark action/procedure and also >>>> via Flink: >>>> I believe currently there are a number of vendors that don't have a >>>> catalog that is capable of performing table maintenance. We for instance >>>> advise our users to use spark procedures for table maintenance. Hence, it >>>> would come quite handy for us to also have a way to control the >>>> functionality behind the 'cleanExpiredMetadata' flag through the >>>> expire_snapshots procedure. Since the functionality is already there in the >>>> Java ExpireSnapshots API, this seems a low effort exercise. >>>> I'd like to avoid telling the users to call the Java API directly, but >>>> if extending the procedure is not an option, and also the used catalog >>>> implementation doesn't give support for this, I don't see what other >>>> possibilities we have here. >>>> Taking these into consideration, would it be possible to allow >>>> extending the Spark and Flink with the support of setting this flag? >>>> >>>> Thanks, >>>> Gabor >>>> >>>> On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote: >>>> >>>>> I don't think it is necessary to either make cleanup the default or to >>>>> expose the flag in Spark or other engines. >>>>> >>>>> Right now, catalogs are taking on a lot more responsibility for things >>>>> like snapshot expiration, orphan file cleanup, and schema or partition >>>>> spec >>>>> removal. Ideally, those are tasks that catalogs handle rather than having >>>>> clients run them, but right now we have systems for keeping tables clean >>>>> (i.e. expiring snapshots) that are built using clients rather than being >>>>> controlled through catalogs. That's not a problem and we want to continue >>>>> to support them, but I also don't think that we should make the problem >>>>> worse. I think we should consider schema and partition spec cleanup to be >>>>> catalog service tasks, so we should not spend much effort to make them >>>>> easily available to users. And we should not make them the default >>>>> behavior >>>>> because we don't want clients removing these manually and duplicating work >>>>> on the client and in REST services. >>>>> >>>>> Ryan >>>>> >>>>> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <j...@nanthrax.net> >>>>> wrote: >>>>> >>>>>> Hi Gabor >>>>>> >>>>>> I think the question is "when". As it's a behavior change, I don't >>>>>> think we should do that on a "minor" release, else users would be >>>>>> "surprised". >>>>>> I would propose to keep the current behavior on Iceberg Java 1.x and >>>>>> change the flag to true on Iceberg Java 2.x (after a vote). >>>>>> >>>>>> Regards >>>>>> JB >>>>>> >>>>>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <gaborkas...@apache.org> >>>>>> wrote: >>>>>> > >>>>>> > Hi Iceberg Community, >>>>>> > >>>>>> > There were recent additions to RemoveSnapshots to expire the unused >>>>>> partition specs and schemas. This is controlled by a flag called >>>>>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, >>>>>> Spark >>>>>> and Flink don't offer a way to set this flag currently. >>>>>> > >>>>>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata >>>>>> > I'm wondering if it's desired by the community to default this flag >>>>>> to true. The effect of that would be that each snapshot expiration would >>>>>> also clean up the unused partition specs and schemas too. This >>>>>> functionality is quite new so this might need some extra confidence by >>>>>> the >>>>>> community before turning on by default but I think it's worth a >>>>>> consideration. >>>>>> > >>>>>> > 2) Spark and Flink to support setting this flag >>>>>> > I think it makes sense to add support in Spark's >>>>>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's >>>>>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag >>>>>> based on (user) inputs. >>>>>> > >>>>>> > WDYT? >>>>>> > >>>>>> > Regards, >>>>>> > Gabor >>>>>> >>>>>