NoahKusaba opened a new pull request, #2613: URL: https://github.com/apache/iceberg-rust/pull/2613
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes https://github.com/apache/datafusion-ballista/issues/1241 ## What changes are included in this PR? <!-- Provide a summary of the modifications in this PR. List the main changes such as new features, bug fixes, refactoring, or any other updates. --> Adds a new iceberg-ballista crate, which provides a distributed-query driver for Apache Iceberg for a distributed datafusion engine Apache Datafusion-Ballista + the targeted changes to iceberg-datafusion that make Iceberg's existing plan nodes serializable so they can cross node boundaries. The core problem it solves Iceberg's DataFusion integration already produces complete physical read and write plans, but every Iceberg plan node holds live, non-serializable state (Arc<dyn Catalog>, an open Table/FileIO). Ballista ships logical and physical plans to remote schedulers/executors, so those nodes couldn't travel. This branch closes that gap with one consistent idea: serialize a minimal self-contained recipe (IcebergCatalogConfig + identifiers), rebuild the live objects on the receiving node. - IcebergLogicalCodec: serializes the catalog-backed table provider (config + table ident, plus snapshot/metadata variants) so the scheduler can rebuild it and do physical planning, including INSERT. - IcebergPhysicalCodec: serializes the four Iceberg execution nodes (IcebergTableScan, IcebergWriteExec, IcebergCommitExec, IcebergMetadataScan) and the PartitionExpr physical expression. - Tagged-envelope wire framing (TAG_ICEBERG / TAG_DELEGATED) — every blob carries a leading tag; non-Iceberg nodes are delegated to Ballista's own codec rather than byte-sniffed, so shuffles/sorts/etc. keep working and an unknown tag is a hard error, not a silent misparse. --> Based off comments from https://github.com/milenkovicm/ballista_delta - serde.rs runtime bridge: block_on that adapts to whatever runtime the sync codec entry point is on (multi-thread → block_in_place; current-thread → dedicated thread; none → temp runtime), plus a process-wide catalog cache (durable-runtime-only) and a loader-based build_catalog that supports every catalog type (rest, sql, glue, hms, s3tables) and storage backend via OpenDalResolvingStorageFactory. Public API: register_iceberg_codecs(SessionConfig) and register_iceberg_table(...); a runnable examples/standalone-iceberg-write.rs. ## Are these changes tested? Ballista tests: - Distributed reads / writes tested against dockerized minio iceberg catalog ( Standalone + multi-executor cluster). Also tests partitioned files writes + iceberg table registration - roundtrips testing that configurations for nodes are maintained through serialization -> deserialization Datafusion tests: - Unit tests for config propagation and snapshot pinning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
