[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

via GitHub Wed, 12 Apr 2023 00:57:01 -0700


waynexia commented on code in PR #5962:
URL: https://github.com/apache/arrow-datafusion/pull/5962#discussion_r1163748404



##########
docs/source/contributor-guide/architecture.md:
##########
@@ -20,7 +20,8 @@
 # Architecture
 
 DataFusion's code structure and organization is described in the
-[Crate Documentation], to keep it as close to the source as
-possible.
+[crates.io documentation], to keep it as close to the source as
+possible. You can find the most up to date version in the [source code].

Review Comment:
   What do you think about hosting the latest document generated from the 
source code on github pages (or other static page hoster)? Like 
[greptimedb.rs](https://greptimedb.rs) which is generated from 
https://github.com/GreptimeTeam/greptimedb/deployments/activity_log?environment=github-pages



##########
docs/source/user-guide/example-usage.md:
##########
@@ -141,3 +141,112 @@ async fn main() -> datafusion::error::Result<()> {
 | 1 | 2      |
 +---+--------+
 ```
+
+# Using DataFusion as a library
+
+## Create a new project
+
+```shell
+cargo new hello_datafusion
+```
+
+```shell
+$ cd hello_datafusion
+$ tree .
+.
+├── Cargo.toml
+└── src
+    └── main.rs
+
+1 directory, 2 files
+```
+
+## Default Configuration
+
+DataFusion is [published on crates.io](https://crates.io/crates/datafusion), 
and is [well documented on docs.rs](https://docs.rs/datafusion/).
+
+To get started, add the following to your `Cargo.toml` file:
+
+```toml
+[dependencies]
+datafusion = "11.0"
+```
+
+## Create a main function
+
+Update the main.rs file with your first datafusion application based on 
[Example 
usage](https://arrow.apache.org/datafusion/user-guide/example-usage.html)
+
+```rust
+use datafusion::prelude::*;
+
+#[tokio::main]
+async fn main() -> datafusion::error::Result<()> {
+  // register the table
+  let ctx = SessionContext::new();
+  ctx.register_csv("test", "<PATH_TO_YOUR_CSV_FILE>", 
CsvReadOptions::new()).await?;
+
+  // create a plan to run a SQL query
+  let df = ctx.sql("SELECT * FROM test").await?;
+
+  // execute and print results
+  df.show().await?;
+  Ok(())
+}
+```
+
+## Extensibility
+
+DataFusion is designed to be extensible at all points. To that end, you can 
provide your own custom:
+
+- [x] User Defined Functions (UDFs)
+- [x] User Defined Aggregate Functions (UDAFs)
+- [x] User Defined Table Source (`TableProvider`) for tables
+- [x] User Defined `Optimizer` passes (plan rewrites)
+- [x] User Defined `LogicalPlan` nodes
+- [x] User Defined `ExecutionPlan` nodes
+
+## Rust Version Compatibility
+
+This crate is tested with the latest stable version of Rust. We do not 
currently test against other, older versions of the Rust compiler.
+
+## Optimized Configuration
+
+For an optimized build several steps are required. First, use the below in 
your `Cargo.toml`. It is
+worth noting that using the settings in the `[profile.release]` section will 
significantly increase the build time.
+
+```toml
+[dependencies]
+datafusion = { version = "11.0" , features = ["simd"]}

Review Comment:
   This is also outdated. Wondering if there are someway to render code / file 
from github? So we needn't update this file every time but rather render our 
example codes. I find 
[this](https://github.blog/2017-08-15-introducing-embedded-code-snippets/) but 
it looks only works inside github.



##########
docs/source/user-guide/faq.md:
##########
@@ -29,3 +29,37 @@ model and computational kernels. It is designed to run 
within a single process,
 for parallel query execution.
 
 [Ballista](https://github.com/apache/arrow-ballista) is a distributed compute 
platform built on DataFusion.
+
+# How does DataFusion Compare with `XYZ`?
+
+When compared to similar systems, DataFusion typically is:
+
+1. Targeted at developers, rather than end users / data scientists.
+2. Designed to be embedded, rather than a complete file based SQL system.
+3. Governed by the [Apache Software Foundation](https://www.apache.org/) 
process, rather than a single company or individual.
+4. Implemented in `Rust`, rather than `C/C++`
+
+Here is a comparison with similar projects that may help understand
+when DataFusion might be be suitable and unsuitable for your needs:
+
+- [DuckDB](http://www.duckdb.org) is an open source, in process analytic 
database.
+  Like DataFusion, it supports very fast execution, both from its custom file 
format
+  and directly from parquet files. Unlike DataFusion, it is written in C/C++ 
and it
+  is primarily used directly by users as a serverless database and query 
system rather
+  than as a library for building such database systems.
+
+- [Polars](http://pola.rs): Polars is one of the fastest DataFrame
+  libraries at the time of writing. Like DataFusion, it is also
+  written in Rust and uses the Apache Arrow memory model, but unlike
+  DataFusion it does not provide SQL nor as many extension points.
+
+- [Facebook Velox](https://engineering.fb.com/2022/08/31/open-source/velox/)
+  is an execution engine. Like DataFusion, Velox aims to
+  provide a reusable foundation for building database-like systems. Unlike 
DataFusion,
+  it is written in C/C++ and does not include a SQL frontend or planning 
/optimization

Review Comment:
   ```suggestion
     it is written in C/C++ and does not include a SQL frontend or planning / 
optimization
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #5962: [DOCS]: consolidate doc site content simplify navbar

Reply via email to