looking at "RewriteDataFilesSparkAction" from your PR #11513, I am fine that the RewriteExecutionContext is captured in the `Plan` object. My earlier point is that we need to pass those common metadata/context to the executor. We don't have to define a separate `PlanInfo` for that purpose if they are captured in the `Plan` already.
[image: image.png] My other point is that those context/plan info may not need to be duplicated. E.g., currently the "RewriteExecutionContext" look duplicated in "RewriteDataFileSparkAction" and "RewritePositionDeleteFilesSparkAction" classes On Fri, Feb 7, 2025 at 4:56 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Steven, > > Thanks for checking this out! > > Created commits which contain only the API changes: > > 1. Everything is stored on the same level as in the current API: > > https://github.com/apache/iceberg/commit/8d612e074dcb8ee6d5ae354e329d0b78e3138c86 > 2. Simplified the API to push down everything to the FileGroupLevel: > > https://github.com/apache/iceberg/commit/26e4704bba08a05d629c61d7eb785bb4e66ce8df > > This is the diff between the 2 options: > https://github.com/apache/iceberg/commit/967b0e5a9b52ee0e303578fccf1d1d538e4689d9 > if you would like to see it. > > I prefer the 2nd option which has fewer classes, and consider pushing down > the outputSpecId to the RewriterFileGroup a fair trade for it. > I would even prefer to merge the RewritePositionDeleteFiles#FileGroupInfo > and RewriteFileGroup#FileGroupInfo if the community would not find that too > invasive. > > If we agree on the general approach, I could create a PR for the API > refactor first, and we can move forward from there. > What do you think? > > Thanks, > Peter > > Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 6., Cs, > 17:50): > >> To be honest, it is hard to visualize the interface/structure discussion >> in this format. More details (in a doc or PR) can be helpful. >> >> Regarding "data organization", I feel we probably can set one common >> metadata class for all action types. We don't need both >> RewriteFilePlanContext and RewritePositionDeletePlanContext. It is kind of >> like the "TableScanContext" in the core module. It could be just a super >> set of all needed fields. >> >> >> On Wed, Feb 5, 2025 at 2:31 AM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> Hi Steven, >>> >>> Thanks for chiming in! >>> >>> The decision points I have collected: >>> >>> - Data organization >>> 1. Plan + Group >>> 2. Group only >>> - Parameter handling >>> 1. All strings >>> 2. Type params >>> - Engine specific parameters for the executor >>> 1. Common set calculated by the planner >>> 2. Engine gets all user params and recalculates what is needed >>> >>> If I understand correctly your answer for the "Data organization" >>> question is to use multiple objects (like PlanContext and Group). >>> Do you have a strong opinion about the other questions? >>> >>> I am a bit concerned about the number of different classes for this API. >>> Currently (before this change) we >>> have RewriteFileGroup/RewritePositionDeletesGroup. If we >>> introduce RewriteFilePlan/RewritePositionDeletePlan and also >>> RewriteFilePlanContext/RewritePositionDeletePlanContext we will have many >>> interfaces/classes with similar implementations. This will make it hard to >>> understand the code. If we push all of the differences down the group level >>> then we might be better off pushing the writeMaxFileSize to the group level >>> as well. This way we can get rid of the PlanContext altogether, and only >>> have a single generic Plan object. >>> >>> Thanks, >>> Peter >>> >>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2025. febr. 5., >>> Sze, 0:18): >>> >>>> At a high level, it makes sense to separate out the planning and >>>> execution to promote reusing the planning code across engines. >>>> >>>> Just to add 4th class to Russel's list >>>> 1) RewriteGroup: A Container that holds all the files that are meant to >>>> be compacted along with information about them >>>> 2) Rewriter: An engine specific class which knows how to take a >>>> RewriteGroup and generate new files, I think this should be independent of >>>> the planner below >>>> 3) Planner: A Non-Engine specific class which knows how to generate >>>> RewriteGroups given a set of parameters >>>> 4) RewritePlan: A Non-Engine specific class for the planner output. it >>>> contains a list of RewriteGroups and common metadata class `PlanContext`. >>>> >>>> Rewriter can take in `RewriteGruop` (a list of files to be compacted) >>>> and `PlanContext` (common metadata) >>>> >>>> >>>> On Sat, Feb 1, 2025 at 3:06 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> We probably still have to support it as long as we have V2 Table >>>>> support right? >>>>> >>>>> On Fri, Jan 31, 2025 at 9:13 AM Péter Váry < >>>>> peter.vary.apa...@gmail.com> wrote: >>>>> >>>>>> We could simplify the API a bit, if we omit DeleteFileRewrite. >>>>>> Since Anton's work around the Puffin delete vectors, this will become >>>>>> obsolete anyway, and focusing on data file rewriting would allow us to >>>>>> remove some generics from the API. >>>>>> >>>>>> WDYT? >>>>>> >>>>>> Russell Spitzer <russell.spit...@gmail.com> ezt írta (időpont: 2025. >>>>>> jan. 21., K, 17:11): >>>>>> >>>>>>> To bump this back up, I think this is a pretty important change to >>>>>>> the core library so it's necessary that we get more folks involved in >>>>>>> this >>>>>>> discussion. I >>>>>>> >>>>>>> I agree that the Rewrite Data Files needs to be broken up and >>>>>>> realigned if we want to be able to reuuse the code in flink. >>>>>>> >>>>>>> I think I prefer that we essentially have >>>>>>> >>>>>>> Three classes >>>>>>> 1) RewriteGroup: A Container that holds all the files that are meant >>>>>>> to be compacted along with information about them >>>>>>> 2) Rewriter: An engine specific class which knows how to take a >>>>>>> RewriteGroup and generate new files, I think this should be independent >>>>>>> of >>>>>>> the planner below >>>>>>> 3) Planner: A Non-Engine specific class which knows how to generate >>>>>>> RewriteGroups given a set of parameters >>>>>>> >>>>>>> On Tue, Jan 14, 2025 at 7:08 AM Péter Váry < >>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Team, >>>>>>>> >>>>>>>> There is ongoing work to bring Flink Table Maintenance to Iceberg >>>>>>>> [1]. We already merged the main infrastructure and are currently >>>>>>>> working on >>>>>>>> implementing the data file rewrite [2]. During the implementation we >>>>>>>> found >>>>>>>> that part of the compaction planning implemented for Spark compaction, >>>>>>>> could and should, be reused in Flink as well. Created a PR [3] to bring >>>>>>>> those changes to the core Iceberg. >>>>>>>> >>>>>>>> The main changes in the API: >>>>>>>> >>>>>>>> - We need to separate the companction planning from the rewrite >>>>>>>> execution >>>>>>>> - The planning would collect the files to be compacted and >>>>>>>> organize them to compaction tasks/groups. This could be reused >>>>>>>> (in the same >>>>>>>> way as the query planning) >>>>>>>> - The rewrite would actually execute the rewrite. This needs >>>>>>>> to contain engine specific code, so we need to have separate >>>>>>>> implementation >>>>>>>> for in for the separate engines >>>>>>>> - We need to decide on the new compaction planning API >>>>>>>> >>>>>>>> The planning currently generates the data for multiple levels: >>>>>>>> >>>>>>>> 1. Plan level >>>>>>>> - Statistics about the plan: >>>>>>>> - Total group count >>>>>>>> - Group count in a partition >>>>>>>> - Target file size >>>>>>>> - Output specification id - only relevant in case of the >>>>>>>> data rewrite plan >>>>>>>> 2. Group level >>>>>>>> - General group info >>>>>>>> - Global index >>>>>>>> - Partition index >>>>>>>> - Partition value >>>>>>>> - List of tasks to read the data >>>>>>>> - Split size - reader input split size when rewriting (Spark >>>>>>>> specific) >>>>>>>> - Number of expected output files - used to calculate >>>>>>>> shuffling partition numbers (Spark specific) >>>>>>>> >>>>>>>> I see the following decision points: >>>>>>>> >>>>>>>> - Data organization: >>>>>>>> 1. Plan is the 'result' - everything below that is only >>>>>>>> organized based on the multiplicity of the data. So if some >>>>>>>> value applies >>>>>>>> to every group, then that value belongs to the 'global' plan >>>>>>>> variables. If >>>>>>>> a value is different for every group, then that value belongs to >>>>>>>> the group >>>>>>>> (current code) >>>>>>>> 2. The group should contain every information which is >>>>>>>> required for a single job. So the job (executor) only receives a >>>>>>>> single >>>>>>>> group and every other bit of information is global. The drawback >>>>>>>> is that >>>>>>>> some information is duplicated, but cleaner on the executor side. >>>>>>>> - Parameter handling: >>>>>>>> 1. Use string maps, like we do with the FileRewriter.options >>>>>>>> - this allows for more generic API which will be more stable >>>>>>>> 2. Use typed, named parameters - when the API is changing >>>>>>>> the users might have breaking code, but could easily spot the >>>>>>>> changes >>>>>>>> - Engine specific parameter handling: >>>>>>>> 1. We generate a common set of parameters >>>>>>>> 2. Engines get the whole compaction configuration, and can >>>>>>>> have their own parameter generators >>>>>>>> >>>>>>>> Currently I am leaning towards: >>>>>>>> >>>>>>>> - Data organization - 2 - group should contain every information >>>>>>>> - Parameter handling - 2 - specific types and named parameters >>>>>>>> - Engine specific parameters - 1 - create a common set of >>>>>>>> parameters >>>>>>>> >>>>>>>> Your thoughts? >>>>>>>> Thanks, >>>>>>>> Peter >>>>>>>> >>>>>>>> [1] - https://github.com/apache/iceberg/issues/10264 >>>>>>>> [2] - https://github.com/apache/iceberg/pull/11497 >>>>>>>> [3] - https://github.com/apache/iceberg/pull/11513 >>>>>>>> >>>>>>>