andormarkus commented on issue #1751:
URL:
https://github.com/apache/iceberg-python/issues/1751#issuecomment-3625274692
Hi @Fokko,
I see this issue has been marked as stale, but I'd like to follow up as we
now have production-scale code working with this approach.
I'm happy to submit a documentation PR to help the community benefit from
this distributed write pattern. Our implementation has been running
successfully in production, handling high-volume concurrent writes with
centralized commits to avoid the blocking bottleneck.
However, I still need your guidance on a few points before proceeding:
1. **\`__bytes__\` support**: You mentioned supporting \`__bytes__\` to
return Avro encoded bytes. Could you please elaborate on this? What would be
the preferred API design? Should this be a method on the \`DataFile\` class
itself?
2. **Public API for \`_dataframe_to_data_files\`**: Currently, achieving
distributed writes requires using the private method
\`pyiceberg.io.pyarrow._dataframe_to_data_files\`. Should we consider making
this (or a wrapper around it) part of the public API? This seems essential for
the distributed write pattern.
3. **Documentation scope**: Where would be the most appropriate place to
document this pattern? Should it be:
- A new section in the main docs about distributed/concurrent writes?
- An advanced patterns/recipes page?
- Part of the write operations documentation?
I'm ready to contribute once I have clarity on the direction you'd like to
take. Our production code includes both partitioned and non-partitioned table
handling, proper serialization/deserialization, and queue-based coordination
(AWS Kinesis, though SQS works fine as well).
Looking forward to your guidance!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]