andygrove commented on code in PR #1113: URL: https://github.com/apache/datafusion-ballista/pull/1113#discussion_r1845009845
########## README.md: ########## @@ -17,53 +17,72 @@ under the License. --> -# Ballista: Distributed SQL Query Engine, built on Apache Arrow +# Ballista: Making DataFusion Applications Distributed -Ballista is a distributed SQL query engine powered by the Rust implementation of [Apache Arrow][arrow] and -[Apache Arrow DataFusion][datafusion]. +Ballista is a library which makes [Apache DataFusion](https://github.com/apache/datafusion) applications distributed. -If you are looking for documentation for a released version of Ballista, please refer to the -[Ballista User Guide][user-guide]. +Existing DataFusion application: -## Overview +```rust +use datafusion::prelude::*; -Ballista implements a similar design to Apache Spark (particularly Spark SQL), but there are some key differences: +#[tokio::main] +async fn main() -> datafusion::error::Result<()> { + let ctx = SessionContext::new(); -- The choice of Rust as the main execution language avoids the overhead of GC pauses and results in deterministic - processing times. -- Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized - processing (SIMD) and efficient compression. Although Spark does have some columnar support, it is still - largely row-based today. -- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than - Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of - distributed compute. -- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged efficiently between - executors using the [Flight Protocol][flight], and between clients and schedulers/executors using the - [Flight SQL Protocol][flight-sql] + // register the table + ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?; + + // create a plan to run a SQL query + let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?; + + // execute and print results + df.show().await?; + Ok(()) +} +``` + +can be distributed with few lines of code changed: + +> [!IMPORTANT] +> There is a gap between DataFusion and Ballista, which may bring incompatibilities. The community is working hard to close this gap + +```rust +use ballista::prelude::*; +use datafusion::prelude::*; + +#[tokio::main] +async fn main() -> datafusion::error::Result<()> { + // create DataFusion SessionContext with ballista standalone cluster started + let ctx = SessionContext::standalone(); + + // register the table + ctx.register_csv("example", "tests/data/example.csv", CsvReadOptions::new()).await?; + + // create a plan to run a SQL query + let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a LIMIT 100").await?; + + // execute and print results + df.show().await?; + Ok(()) +} +``` + +If you are looking for documentation or more examples, please refer to the [Ballista User Guide][user-guide]. ## Architecture A Ballista cluster consists of one or more scheduler processes and one or more executor processes. These processes can be run as native binaries and are also available as Docker Images, which can be easily deployed with -[Docker Compose](https://datafusion.apache.org/ballista/user-guide/deployment/docker-compose.html) or -[Kubernetes](https://datafusion.apache.org/ballista/user-guide/deployment/kubernetes.html). Review Comment: Why do we want to remove the k8s docs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
