atognolas opened a new pull request, #39158: URL: https://github.com/apache/beam/pull/39158
## Summary - Modify `AppendFilesToTables.appendDataFiles()` to group data files by partition path and write one manifest per partition using `appendManifest()` - The commit remains atomic (single `AppendFiles` operation) while producing partition-aligned manifests - This enables manifest-level partition pruning in query engines (BigQuery, Trino) without a post-write `rewriteManifests` step ## Motivation When all files share a single partition spec, `appendDataFiles()` currently uses `appendFile()` which places everything into one manifest. With hundreds of partitions, query engines must scan all file entries in the manifest even for single-partition queries. Measured on a 400-partition table with 99-column schema: - Single manifest: **14s** BQ slot time - 400 partition-aligned manifests: **2.86s** BQ slot time (**5× improvement**) ## Notes Iceberg's `commit.manifest-merge.enabled` (default `true`) will merge these manifests back into fewer manifests. Users who want to preserve partition alignment should set this to `false` or run periodic `rewriteManifests`. ## Test plan - [ ] Existing `AppendFilesToTablesTest` passes - [ ] Run IcebergIO integration tests - [ ] Verify manifest count matches partition count via Iceberg metadata inspection - [ ] Verify query engines benefit from manifest-level pruning 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
