efredine commented on code in PR #11290:
URL: https://github.com/apache/datafusion/pull/11290#discussion_r1668904890
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,129 +19,267 @@
# Using the DataFrame API
-## What is a DataFrame
+The [Users Guide] introduces the [`DataFrame`] API and this section describes
+that API in more depth.
-`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface,
and is a thin wrapper over LogicalPlan that adds functionality for building and
executing those plans.
+## What is a DataFrame?
-```rust
-pub struct DataFrame {
- session_state: SessionState,
- plan: LogicalPlan,
-}
-```
-
-You can build up `DataFrame`s using its methods, similarly to building
`LogicalPlan`s using `LogicalPlanBuilder`:
-
-```rust
-let df = ctx.table("users").await?;
+As described in the [Users Guide], DataFusion [`DataFrame`]s are modeled after
+the [Pandas DataFrame] interface, and are implemented as thin wrapper over a
+[`LogicalPlan`] that adds functionality for building and executing those plans.
-// Create a new DataFrame sorted by `id`, `bank_account`
-let new_df = df.select(vec![col("id"), col("bank_account")])?
- .sort(vec![col("id")])?;
-
-// Build the same plan using the LogicalPlanBuilder
-let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
- .project(vec![col("id"), col("bank_account")])?
- .sort(vec![col("id")])?
- .build()?;
-```
-
-You can use `collect` or `execute_stream` to execute the query.
+The simplest possible dataframe is one that scans a table and that table can be
+in a file or in memory.
## How to generate a DataFrame
-You can directly use the `DataFrame` API or generate a `DataFrame` from a SQL
query.
-
-For example, to use `sql` to construct `DataFrame`:
+You can construct [`DataFrame`]s programmatically using the API, similarly to
+other DataFrame APIs. For example, you can read an in memory `RecordBatch` into
+a `DataFrame`:
```rust
-let ctx = SessionContext::new();
-// Register the in-memory table containing the data
-ctx.register_table("users", Arc::new(create_memtable()?))?;
-let dataframe = ctx.sql("SELECT * FROM users;").await?;
+use std::sync::Arc;
+use datafusion::prelude::*;
+use datafusion::arrow::array::{ArrayRef, Int32Array};
+use datafusion::arrow::record_batch::RecordBatch;
+use datafusion::error::Result;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+ let ctx = SessionContext::new();
+ // Register an in-memory table containing the following data
+ // id | bank_account
+ // ---|-------------
+ // 1 | 9000
+ // 2 | 8000
+ // 3 | 7000
Review Comment:
The in-memory examples are concise and its easy to get the gist of what's
going on. But it also throws people in to the deep end of the Arrow format
which lacks a gentle introduction IMO. The Arrow-rs documentation gets
immediately into the weeds!
It's likely that many users might never even need to know or access the
arrow format directly. They will just read and write to csv or parquet.
I don't think this needs to change, but perhaps what's missing is a section
on how and when to use the Arrow format? A gentler introduction to Record
Batches.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]