Re: [PR] Update root `README.md` and other documentation with latest changes [datafusion-ballista]

via GitHub Sat, 16 Nov 2024 07:51:52 -0800


andygrove commented on code in PR #1113:
URL: 
https://github.com/apache/datafusion-ballista/pull/1113#discussion_r1845009845



##########
README.md:
##########
@@ -17,53 +17,72 @@
   under the License.
 -->
 
-# Ballista: Distributed SQL Query Engine, built on Apache Arrow
+# Ballista: Making DataFusion Applications Distributed
 
-Ballista is a distributed SQL query engine powered by the Rust implementation 
of [Apache Arrow][arrow] and
-[Apache Arrow DataFusion][datafusion].
+Ballista is a library which makes [Apache 
DataFusion](https://github.com/apache/datafusion) applications distributed.
 
-If you are looking for documentation for a released version of Ballista, 
please refer to the
-[Ballista User Guide][user-guide].
+Existing DataFusion application:
 
-## Overview
+```rust
+use datafusion::prelude::*;
 
-Ballista implements a similar design to Apache Spark (particularly Spark SQL), 
but there are some key differences:
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  let ctx = SessionContext::new();
 
-- The choice of Rust as the main execution language avoids the overhead of GC 
pauses and results in deterministic
-  processing times.
-- Ballista is designed from the ground up to use columnar data, enabling a 
number of efficiencies such as vectorized
-  processing (SIMD) and efficient compression. Although Spark does have some 
columnar support, it is still
-  largely row-based today.
-- The combination of Rust and Arrow provides excellent memory efficiency and 
memory usage can be 5x - 10x lower than
-  Apache Spark in some cases, which means that more processing can fit on a 
single node, reducing the overhead of
-  distributed compute.
-- The use of Apache Arrow as the memory model and network protocol means that 
data can be exchanged efficiently between
-  executors using the [Flight Protocol][flight], and between clients and 
schedulers/executors using the
-  [Flight SQL Protocol][flight-sql]
+  // register the table
+  ctx.register_csv("example", "tests/data/example.csv", 
CsvReadOptions::new()).await?;
+
+  // create a plan to run a SQL query
+  let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a 
LIMIT 100").await?;
+
+  // execute and print results
+  df.show().await?;
+  Ok(())
+}
+```
+
+can be distributed with few lines of code changed:
+
+> [!IMPORTANT]  
+> There is a gap between DataFusion and Ballista, which may bring 
incompatibilities. The community is working hard to close this gap
+
+```rust
+use ballista::prelude::*;
+use datafusion::prelude::*;
+
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  // create DataFusion SessionContext with ballista standalone cluster started
+  let ctx = SessionContext::standalone();
+
+  // register the table
+  ctx.register_csv("example", "tests/data/example.csv", 
CsvReadOptions::new()).await?;
+
+  // create a plan to run a SQL query
+  let df = ctx.sql("SELECT a, MIN(b) FROM example WHERE a <= b GROUP BY a 
LIMIT 100").await?;
+
+  // execute and print results
+  df.show().await?;
+  Ok(())
+}
+```
+
+If you are looking for documentation or more examples, please refer to the 
[Ballista User Guide][user-guide].
 
 ## Architecture
 
 A Ballista cluster consists of one or more scheduler processes and one or more 
executor processes. These processes
 can be run as native binaries and are also available as Docker Images, which 
can be easily deployed with
-[Docker 
Compose](https://datafusion.apache.org/ballista/user-guide/deployment/docker-compose.html)
 or
-[Kubernetes](https://datafusion.apache.org/ballista/user-guide/deployment/kubernetes.html).

Review Comment:
   Why do we want to remove the k8s docs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Update root `README.md` and other documentation with latest changes [datafusion-ballista]

Reply via email to