szehon-ho edited a comment on pull request #2608: URL: https://github.com/apache/iceberg/pull/2608#issuecomment-865317271
As per the discussion that it deserves to be its own action rather than part of RewriteManifests, completely rewrote RepairManifests to be a separate spark action (BaseRepairManifestsSparkAction), and removed the base logic between it and BaseRewriteManifestAction to base class: BaseManifestSparkAction. Summary, it distributes the repair, first grouping all entries by ManifestFile, calculating what needs to be repaired for each entry by reading various aspects of the dataFile pointed to by the entry, and writing all the entries back out if any needed repair (the manifest file still retains same number of entries). Not all logic can be shared. In Repair path, the specId is queried from the original manifest-file , and kept around to write the repaired manifest file (vs passed in). There is also a problem I noticed, the returned ManifestFiles of RewriteManifests action is wrong if "snapshotIdInheritanceEnabled" is false (as this path rewrites the manifest-file to a final location). So fixed the method while extracting it from BaseRewriteManifestsSparkAction to the new base. (A subsequent change can fix this issue and add a test in RewriteManifests). @rdblue @aokolnychyi @flyrain @RussellSpitzer if you guys have time for another look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
