[I] [Bug]: Improve performance of the Iceberg AddFiles transform [beam]

via GitHub Tue, 31 Mar 2026 08:42:56 -0700


chamikaramj opened a new issue, #38012:
URL: https://github.com/apache/beam/issues/38012


   ### What happened?
   
   Currently Iceberg AddFiles transform has some performance bottlenecks when 
we try to write a large number of files. For example, we fully read parquet 
files being written [1] which can significantly slow down the process. We 
should look into improving the single VM performance without compromising 
consistency guarantees of the sink.
   
   [1] 
https://github.com/apache/beam/blob/e08e9d56e5ee8ece43cc15967d0edff107651554/sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/AddFiles.java#L685
   
   ### Issue Priority
   
   Priority: 1 (data loss / total loss of function)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [x] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Infrastructure
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug]: Improve performance of the Iceberg AddFiles transform [beam]

Reply via email to