alamb commented on code in PR #9260: URL: https://github.com/apache/arrow-datafusion/pull/9260#discussion_r1495321696
########## datafusion/substrait/src/lib.rs: ########## @@ -15,6 +15,61 @@ // specific language governing permissions and limitations // under the License. +//! Serialize / Deserialize DataFusion Plans to [Substrait.io] +//! +//! This crate provides support for serializing and deserializing DataFusion plans +//! to and from the generated types in [substrait::proto] from the [substrait] crate. +//! +//! [Substrait.io] provides a cross-language serialization format for relational +//! algebra (e.g. query plans and expressions), based on protocol buffers. +//! +//! [Substrait.io]: https://substrait.io/ +//! +//! [`LogicalPlan`]: datafusion::logical_expr::LogicalPlan +//! [`ExecutionPlan`]: datafusion::physical_plan::ExecutionPlan +//! +//! Potential uses of this crate: +//! * Use DataFusion run Substrait plans created by other systems (e.g. Apache Calcite) +//! * Use DataFusion to create plans to run on other systems +//! * Pass query plans over FFI boundaries, such as from Python to Rust Review Comment: I added a note in 708c7cd11 I am curious why you chose to substrait and not datafusion-proto *within* the same system (we use datafusion-proto for this case, for its wider compatibility) ########## datafusion/proto/README.md: ########## @@ -17,121 +17,13 @@ under the License. --> -# Apache Arrow DataFusion Proto +# `datafusion-proto`: Apache Arrow DataFusion Proto Serialization / Deserialization -Apache Arrow [DataFusion][df] is an extensible query execution framework, -written in Rust, that uses Apache Arrow as its in-memory format. +This crate contains code to convert Apache Arrow [DataFusion] plans to and from +bytes, which can be useful for sending plans over the network, for example +when building a distributed query engine. -This crate provides support format for serializing and deserializing the -following structures to and from bytes: +See [API Docs] for details and examples. -1. [`LogicalPlan`]'s (including [`Expr`]), -2. [`ExecutionPlan`]s (including [`PhysiscalExpr`]) - -This format can be useful for sending plans over the network, for example when -building a distributed query engine. - -Internally, this crate is implemented by converting the plans to [protocol -buffers] using [prost]. - -[protocol buffers]: https://developers.google.com/protocol-buffers -[`logicalplan`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html -[`expr`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/expr/enum.Expr.html -[`executionplan`]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html -[`physiscalexpr`]: https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html -[prost]: https://docs.rs/prost/latest/prost/ - -## See Also - -The binary format created by this crate supports the full range of DataFusion -plans, but is DataFusion specific. See [datafusion-substrait] which can encode -many DataFusion plans using the [substrait.io] standard. - -[datafusion-substrait]: https://docs.rs/datafusion-substrait/latest/datafusion_substrait -[substrait.io]: https://substrait.io - -# Examples - -## Serializing Expressions - -Based on [examples/expr_serde.rs](examples/expr_serde.rs) - -```rust -use datafusion_common::Result; -use datafusion_expr::{col, lit, Expr}; -use datafusion_proto::bytes::Serializeable; - -fn main() -> Result<()> { - // Create a new `Expr` a < 32 - let expr = col("a").lt(lit(5i32)); - - // Convert it to an opaque form - let bytes = expr.to_bytes()?; - - // Decode bytes from somewhere (over network, etc.) - let decoded_expr = Expr::from_bytes(&bytes)?; - assert_eq!(expr, decoded_expr); - Ok(()) -} -``` - -## Serializing Logical Plans - -Based on [examples/logical_plan_serde.rs](examples/logical_plan_serde.rs) - -```rust -use datafusion::prelude::*; -use datafusion_common::Result; -use datafusion_proto::bytes::{logical_plan_from_bytes, logical_plan_to_bytes}; - -#[tokio::main] -async fn main() -> Result<()> { - let ctx = SessionContext::new(); - ctx.register_csv("t1", "tests/testdata/test.csv", CsvReadOptions::default()) - .await - ?; - let plan = ctx.table("t1").await?.into_optimized_plan()?; - let bytes = logical_plan_to_bytes(&plan)?; - let logical_round_trip = logical_plan_from_bytes(&bytes, &ctx)?; - assert_eq!(format!("{:?}", plan), format!("{:?}", logical_round_trip)); - Ok(()) -} -``` - -## Serializing Physical Plans - -Based on [examples/physical_plan_serde.rs](examples/physical_plan_serde.rs) - -```rust -use datafusion::prelude::*; -use datafusion_common::Result; -use datafusion_proto::bytes::{physical_plan_from_bytes,physical_plan_to_bytes}; - -#[tokio::main] -async fn main() -> Result<()> { - let ctx = SessionContext::new(); - ctx.register_csv("t1", "tests/testdata/test.csv", CsvReadOptions::default()) - .await - ?; - let logical_plan = ctx.table("t1").await?.into_optimized_plan()?; - let physical_plan = ctx.create_physical_plan(&logical_plan).await?; - let bytes = physical_plan_to_bytes(physical_plan.clone())?; - let physical_round_trip = physical_plan_from_bytes(&bytes, &ctx)?; - assert_eq!(format!("{:?}", physical_plan), format!("{:?}", physical_round_trip)); - Ok(()) -} - -``` - -## Generated Code - -The prost/tonic code can be generated by running, which in turn invokes the Rust binary located in [gen](./gen) - -This is necessary after modifying the protobuf definitions or altering the dependencies of [gen](./gen), and requires a -valid installation of [protoc](https://github.com/protocolbuffers/protobuf#protocol-compiler-installation). - -```bash -./regen.sh -``` - -[df]: https://crates.io/crates/datafusion +[datafusion]: https://arrow.apache.org/datafusion +[api docs]: http://docs.rs/datafusion-substrait/latest/datafusion-substrait Review Comment: Yes, excellent catch. Fixed in f596397fe ########## datafusion/proto/REAME-dev.md: ########## @@ -0,0 +1,29 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +## Generated Code + +The prost/tonic code can be generated by running, which in turn invokes the Rust binary located in [gen](./gen) + +This is necessary after modifying the protobuf definitions or altering the dependencies of [gen](./gen), and requires a +valid installation of [protoc](https://github.com/protocolbuffers/protobuf#protocol-compiler-installation). Review Comment: I think a link to the rendered docs makes sense -- updated in aa09db69f. Thanks for the suggestion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
