atognolas opened a new pull request, #39158:
URL: https://github.com/apache/beam/pull/39158

   ## Summary
   - Modify `AppendFilesToTables.appendDataFiles()` to group data files by 
partition path and write one manifest per partition using `appendManifest()`
   - The commit remains atomic (single `AppendFiles` operation) while producing 
partition-aligned manifests
   - This enables manifest-level partition pruning in query engines (BigQuery, 
Trino) without a post-write `rewriteManifests` step
   
   ## Motivation
   When all files share a single partition spec, `appendDataFiles()` currently 
uses `appendFile()` which places everything into one manifest. With hundreds of 
partitions, query engines must scan all file entries in the manifest even for 
single-partition queries.
   
   Measured on a 400-partition table with 99-column schema:
   - Single manifest: **14s** BQ slot time
   - 400 partition-aligned manifests: **2.86s** BQ slot time (**5× 
improvement**)
   
   ## Notes
   Iceberg's `commit.manifest-merge.enabled` (default `true`) will merge these 
manifests back into fewer manifests. Users who want to preserve partition 
alignment should set this to `false` or run periodic `rewriteManifests`.
   
   ## Test plan
   - [ ] Existing `AppendFilesToTablesTest` passes
   - [ ] Run IcebergIO integration tests
   - [ ] Verify manifest count matches partition count via Iceberg metadata 
inspection
   - [ ] Verify query engines benefit from manifest-level pruning
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to