Re: [PR] Improve and test dataframe API examples in docs [datafusion]

via GitHub Mon, 08 Jul 2024 08:58:20 -0700


efredine commented on code in PR #11290:
URL: https://github.com/apache/datafusion/pull/11290#discussion_r1668904890



##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,129 +19,267 @@
 
 # Using the DataFrame API
 
-## What is a DataFrame
+The [Users Guide] introduces the [`DataFrame`] API and this section describes
+that API in more depth.
 
-`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface, 
and is a thin wrapper over LogicalPlan that adds functionality for building and 
executing those plans.
+## What is a DataFrame?
 
-```rust
-pub struct DataFrame {
-    session_state: SessionState,
-    plan: LogicalPlan,
-}
-```
-
-You can build up `DataFrame`s using its methods, similarly to building 
`LogicalPlan`s using `LogicalPlanBuilder`:
-
-```rust
-let df = ctx.table("users").await?;
+As described in the [Users Guide], DataFusion [`DataFrame`]s are modeled after
+the [Pandas DataFrame] interface, and are implemented as thin wrapper over a
+[`LogicalPlan`] that adds functionality for building and executing those plans.
 
-// Create a new DataFrame sorted by  `id`, `bank_account`
-let new_df = df.select(vec![col("id"), col("bank_account")])?
-    .sort(vec![col("id")])?;
-
-// Build the same plan using the LogicalPlanBuilder
-let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
-    .project(vec![col("id"), col("bank_account")])?
-    .sort(vec![col("id")])?
-    .build()?;
-```
-
-You can use `collect` or `execute_stream` to execute the query.
+The simplest possible dataframe is one that scans a table and that table can be
+in a file or in memory.
 
 ## How to generate a DataFrame
 
-You can directly use the `DataFrame` API or generate a `DataFrame` from a SQL 
query.
-
-For example, to use `sql` to construct `DataFrame`:
+You can construct [`DataFrame`]s programmatically using the API, similarly to
+other DataFrame APIs. For example, you can read an in memory `RecordBatch` into
+a `DataFrame`:
 
 ```rust
-let ctx = SessionContext::new();
-// Register the in-memory table containing the data
-ctx.register_table("users", Arc::new(create_memtable()?))?;
-let dataframe = ctx.sql("SELECT * FROM users;").await?;
+use std::sync::Arc;
+use datafusion::prelude::*;
+use datafusion::arrow::array::{ArrayRef, Int32Array};
+use datafusion::arrow::record_batch::RecordBatch;
+use datafusion::error::Result;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let ctx = SessionContext::new();
+    // Register an in-memory table containing the following data
+    // id | bank_account
+    // ---|-------------
+    // 1  | 9000
+    // 2  | 8000
+    // 3  | 7000

Review Comment:
   The in-memory examples are concise and its easy to get the gist of what's 
going on. But it also throws people in to the deep end of the Arrow format 
which lacks a gentle introduction IMO. The Arrow-rs documentation gets 
immediately into the weeds! 
   
   It's likely that many users might never even need to know or access the 
arrow format directly. They will just read and write to csv or parquet.
   
   I don't think this needs to change, but perhaps what's missing is a section 
on how and when to use the Arrow format? A gentler introduction to Record 
Batches. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve and test dataframe API examples in docs [datafusion]

Reply via email to