I think a catalog service can also use Spark/Flink procedures for table maintenance, to utilize existing systems and cluster resources.
If we no longer support new functionality in Spark/Flink procedures, we are effectively deprecating them, right? Gabor Kaszab <gaborkas...@apache.org>于2025年3月20日 周四00:07写道: > Thanks for the responses so far! > > Sure, keeping the default as false makes sense because this is a new > feature, so let's be on the safe side. > > About exposing setting the flag in the Spark action/procedure and also via > Flink: > I believe currently there are a number of vendors that don't have a > catalog that is capable of performing table maintenance. We for instance > advise our users to use spark procedures for table maintenance. Hence, it > would come quite handy for us to also have a way to control the > functionality behind the 'cleanExpiredMetadata' flag through the > expire_snapshots procedure. Since the functionality is already there in the > Java ExpireSnapshots API, this seems a low effort exercise. > I'd like to avoid telling the users to call the Java API directly, but if > extending the procedure is not an option, and also the used catalog > implementation doesn't give support for this, I don't see what other > possibilities we have here. > Taking these into consideration, would it be possible to allow extending > the Spark and Flink with the support of setting this flag? > > Thanks, > Gabor > > On Fri, Mar 14, 2025 at 6:37 PM Ryan Blue <rdb...@gmail.com> wrote: > >> I don't think it is necessary to either make cleanup the default or to >> expose the flag in Spark or other engines. >> >> Right now, catalogs are taking on a lot more responsibility for things >> like snapshot expiration, orphan file cleanup, and schema or partition spec >> removal. Ideally, those are tasks that catalogs handle rather than having >> clients run them, but right now we have systems for keeping tables clean >> (i.e. expiring snapshots) that are built using clients rather than being >> controlled through catalogs. That's not a problem and we want to continue >> to support them, but I also don't think that we should make the problem >> worse. I think we should consider schema and partition spec cleanup to be >> catalog service tasks, so we should not spend much effort to make them >> easily available to users. And we should not make them the default behavior >> because we don't want clients removing these manually and duplicating work >> on the client and in REST services. >> >> Ryan >> >> On Fri, Mar 14, 2025 at 8:16 AM Jean-Baptiste Onofré <j...@nanthrax.net> >> wrote: >> >>> Hi Gabor >>> >>> I think the question is "when". As it's a behavior change, I don't >>> think we should do that on a "minor" release, else users would be >>> "surprised". >>> I would propose to keep the current behavior on Iceberg Java 1.x and >>> change the flag to true on Iceberg Java 2.x (after a vote). >>> >>> Regards >>> JB >>> >>> On Fri, Mar 14, 2025 at 12:18 PM Gabor Kaszab <gaborkas...@apache.org> >>> wrote: >>> > >>> > Hi Iceberg Community, >>> > >>> > There were recent additions to RemoveSnapshots to expire the unused >>> partition specs and schemas. This is controlled by a flag called >>> 'cleanExpiredMetadata' and has a default value 'false'. Additionally, Spark >>> and Flink don't offer a way to set this flag currently. >>> > >>> > 1) Default value of RemoveSnapshots.cleanExpiredMetadata >>> > I'm wondering if it's desired by the community to default this flag to >>> true. The effect of that would be that each snapshot expiration would also >>> clean up the unused partition specs and schemas too. This functionality is >>> quite new so this might need some extra confidence by the community before >>> turning on by default but I think it's worth a consideration. >>> > >>> > 2) Spark and Flink to support setting this flag >>> > I think it makes sense to add support in Spark's >>> ExpireSnapshotProcedure and ExpireSnapshotsSparkAction also to Flink's >>> ExpireSnapshotsProcessor and ExpireSnapshots to allow setting this flag >>> based on (user) inputs. >>> > >>> > WDYT? >>> > >>> > Regards, >>> > Gabor >>> >>