alamb commented on code in PR #8319:
URL: https://github.com/apache/arrow-datafusion/pull/8319#discussion_r1407614920
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
Review Comment:
```suggestion
`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface,
and is a thin wrapper over LogicalPlan that adds functionality for building and
executing those plans.
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
+ .project(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?
+ .build()?;
+```
+
+But The main difference between a DataFrame and a LogicalPlan is that the
DataFrame contains functionality for executing queries rather than just
building plans.
+
+You can use `collect` or `execute_stream` to execute the query.
+
+## How to generate a DataFrame
+
+You can manually call the `DataFrame` API or automatically generate a
`DataFrame` through the SQL query planner just like:
+
+use `sql` to construct `DataFrame`:
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx.sql("SELECT * FROM users;").await?;
+```
+
+construct `DataFrame` manually
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx
+ .table("users")
+ .filter(col("a").lt_eq(col("b")))?
+ .sort(vec![col("a").sort(true, true), col("b").sort(false, false)])?;
+```
+
+## Collect / Streaming Exec
+
+When you have a `DataFrame`, you may want to access the results of the
internal `LogicalPlan`. You can do this by using `collect` to retrieve all
outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.
+
+You can just collect all outputs once like:
+
+```rust
+let ctx = SessionContext::new();
+let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+let batches = df.collect().await?;
+```
+
+You can also use stream output to iterate the `RecordBatch`
+
+```rust
+let ctx = SessionContext::new();
+let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+let mut stream = df.execute_stream().await?;
+while let Some(rb) = stream.next().await {
+ println!("{rb:?}");
+}
+```
+
+# Write DataFrame to Files
+
+You can also serialize `DataFrame` to a file. For now, `Datafusion` supports
write `DataFrame` to `csv`, `json` and `parquet`.
+
+Before writing to a file, it will call collect to calculate all the results of
the DataFrame and then write to file.
Review Comment:
I don't think the DataFrame API calls collect -- instead I think it uses the
streaming APIs
```suggestion
When writing a file, DataFusion will execute the DataFrame and stream the
results to a file.
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
+ .project(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?
+ .build()?;
+```
+
+But The main difference between a DataFrame and a LogicalPlan is that the
DataFrame contains functionality for executing queries rather than just
building plans.
+
+You can use `collect` or `execute_stream` to execute the query.
Review Comment:
```suggestion
You can use `collect` or `execute_stream` to execute the query.
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
Review Comment:
```suggestion
// Build the same plan using the LogicalPlanBuilder
let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
+ .project(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?
+ .build()?;
+```
+
+But The main difference between a DataFrame and a LogicalPlan is that the
DataFrame contains functionality for executing queries rather than just
building plans.
+
+You can use `collect` or `execute_stream` to execute the query.
+
+## How to generate a DataFrame
+
+You can manually call the `DataFrame` API or automatically generate a
`DataFrame` through the SQL query planner just like:
+
+use `sql` to construct `DataFrame`:
Review Comment:
```suggestion
For example, to use `sql` to construct `DataFrame`:
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
+ .project(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?
+ .build()?;
+```
+
+But The main difference between a DataFrame and a LogicalPlan is that the
DataFrame contains functionality for executing queries rather than just
building plans.
+
+You can use `collect` or `execute_stream` to execute the query.
+
+## How to generate a DataFrame
+
+You can manually call the `DataFrame` API or automatically generate a
`DataFrame` through the SQL query planner just like:
+
+use `sql` to construct `DataFrame`:
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx.sql("SELECT * FROM users;").await?;
+```
+
+construct `DataFrame` manually
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx
+ .table("users")
+ .filter(col("a").lt_eq(col("b")))?
+ .sort(vec![col("a").sort(true, true), col("b").sort(false, false)])?;
+```
+
+## Collect / Streaming Exec
+
+When you have a `DataFrame`, you may want to access the results of the
internal `LogicalPlan`. You can do this by using `collect` to retrieve all
outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.
+
+You can just collect all outputs once like:
+
+```rust
+let ctx = SessionContext::new();
+let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+let batches = df.collect().await?;
+```
+
+You can also use stream output to iterate the `RecordBatch`
Review Comment:
```suggestion
You can also use stream output to incrementally generate output one
`RecordBatch` at a time
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
+ .project(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?
+ .build()?;
+```
+
+But The main difference between a DataFrame and a LogicalPlan is that the
DataFrame contains functionality for executing queries rather than just
building plans.
+
+You can use `collect` or `execute_stream` to execute the query.
+
+## How to generate a DataFrame
+
+You can manually call the `DataFrame` API or automatically generate a
`DataFrame` through the SQL query planner just like:
+
+use `sql` to construct `DataFrame`:
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx.sql("SELECT * FROM users;").await?;
+```
+
+construct `DataFrame` manually
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx
+ .table("users")
+ .filter(col("a").lt_eq(col("b")))?
+ .sort(vec![col("a").sort(true, true), col("b").sort(false, false)])?;
+```
+
+## Collect / Streaming Exec
+
+When you have a `DataFrame`, you may want to access the results of the
internal `LogicalPlan`. You can do this by using `collect` to retrieve all
outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.
+
+You can just collect all outputs once like:
+
+```rust
+let ctx = SessionContext::new();
+let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+let batches = df.collect().await?;
+```
+
+You can also use stream output to iterate the `RecordBatch`
+
+```rust
+let ctx = SessionContext::new();
+let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+let mut stream = df.execute_stream().await?;
+while let Some(rb) = stream.next().await {
+ println!("{rb:?}");
+}
+```
+
+# Write DataFrame to Files
+
+You can also serialize `DataFrame` to a file. For now, `Datafusion` supports
write `DataFrame` to `csv`, `json` and `parquet`.
+
+Before writing to a file, it will call collect to calculate all the results of
the DataFrame and then write to file.
+
+For example, if you write it to a csv_file
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(mem_table))?;
+let dataframe = ctx.sql("SELECT * FROM users;").await?;
+
+dataframe
+ .write_csv("user_dataframe.csv", DataFrameWriteOptions::default(), None)
+ .await;
+```
+
+and the file will look like (Example Output):
+
+```
+id,bank_account
+1,9000
+```
+
+## Transform between LogicalPlan and DataFrame
+
+As it is showed above, `DataFrame` is just a very thin wrapper of
`LogicalPlan`, so you can easily go back and forth between them.
Review Comment:
```suggestion
As shown above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so
you can easily go back and forth between them.
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
Review Comment:
```suggestion
You can build up `DataFrame`s using its methods, similarly to building
`LogicalPlan`s using `LogicalPlanBuilder`:
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
+ .project(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?
+ .build()?;
+```
+
+But The main difference between a DataFrame and a LogicalPlan is that the
DataFrame contains functionality for executing queries rather than just
building plans.
+
+You can use `collect` or `execute_stream` to execute the query.
+
+## How to generate a DataFrame
+
+You can manually call the `DataFrame` API or automatically generate a
`DataFrame` through the SQL query planner just like:
Review Comment:
```suggestion
You can directly use the `DataFrame` API or generate a `DataFrame` from a
SQL query.
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
Review Comment:
```suggestion
// Create a new DataFrame sorted by `id`, `bank_account`
let new_df = df.select(vec![col("id"), col("bank_account")])?
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
+ .project(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?
+ .build()?;
+```
+
+But The main difference between a DataFrame and a LogicalPlan is that the
DataFrame contains functionality for executing queries rather than just
building plans.
+
+You can use `collect` or `execute_stream` to execute the query.
+
+## How to generate a DataFrame
+
+You can manually call the `DataFrame` API or automatically generate a
`DataFrame` through the SQL query planner just like:
+
+use `sql` to construct `DataFrame`:
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx.sql("SELECT * FROM users;").await?;
+```
+
+construct `DataFrame` manually
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx
+ .table("users")
+ .filter(col("a").lt_eq(col("b")))?
+ .sort(vec![col("a").sort(true, true), col("b").sort(false, false)])?;
+```
+
+## Collect / Streaming Exec
+
+When you have a `DataFrame`, you may want to access the results of the
internal `LogicalPlan`. You can do this by using `collect` to retrieve all
outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.
Review Comment:
```suggestion
DataFusion `DataFrame`s are "lazy", meaning they do not do any processing
until they are executed, which allows for additional optimizations.
When you have a `DataFrame`, you can run it in one of three ways:
1. `collect` which executes the query and buffers all the output into a
`Vec<RecordBatch>`
2. `streaming_exec`, which begins executions and returns a
`SendableRecordBatchStream` which incrementally computes output on each call to
`next()`
3. `cache` which executes the query and buffers the output into a new in
memory DataFrame.
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
+ .project(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?
+ .build()?;
+```
+
+But The main difference between a DataFrame and a LogicalPlan is that the
DataFrame contains functionality for executing queries rather than just
building plans.
+
+You can use `collect` or `execute_stream` to execute the query.
+
+## How to generate a DataFrame
+
+You can manually call the `DataFrame` API or automatically generate a
`DataFrame` through the SQL query planner just like:
+
+use `sql` to construct `DataFrame`:
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx.sql("SELECT * FROM users;").await?;
+```
+
+construct `DataFrame` manually
Review Comment:
```suggestion
To construct `DataFrame` using the API:
```
##########
docs/source/library-user-guide/using-the-dataframe-api.md:
##########
@@ -19,4 +19,123 @@
# Using the DataFrame API
-Coming Soon
+## What is a DataFrame
+
+`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over
LogicalPlan.
+
+```rust
+pub struct DataFrame {
+ session_state: SessionState,
+ plan: LogicalPlan,
+}
+```
+
+For both `DataFrame` and `LogicalPlan`, you can build the query manually, such
as:
+
+```rust
+let df = ctx.table("users").await?;
+
+let new_df = df.select(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?;
+
+let plan = LogicalPlanBuilder::from(&df.to_logical_plan())
+ .project(vec![col("id"), col("bank_account")])?
+ .sort(vec![col("id")])?
+ .build()?;
+```
+
+But The main difference between a DataFrame and a LogicalPlan is that the
DataFrame contains functionality for executing queries rather than just
building plans.
+
+You can use `collect` or `execute_stream` to execute the query.
+
+## How to generate a DataFrame
+
+You can manually call the `DataFrame` API or automatically generate a
`DataFrame` through the SQL query planner just like:
+
+use `sql` to construct `DataFrame`:
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx.sql("SELECT * FROM users;").await?;
+```
+
+construct `DataFrame` manually
+
+```rust
+let ctx = SessionContext::new();
+// Register the in-memory table containing the data
+ctx.register_table("users", Arc::new(create_memtable()?))?;
+let dataframe = ctx
+ .table("users")
+ .filter(col("a").lt_eq(col("b")))?
+ .sort(vec![col("a").sort(true, true), col("b").sort(false, false)])?;
+```
+
+## Collect / Streaming Exec
+
+When you have a `DataFrame`, you may want to access the results of the
internal `LogicalPlan`. You can do this by using `collect` to retrieve all
outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.
+
+You can just collect all outputs once like:
+
+```rust
+let ctx = SessionContext::new();
+let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+let batches = df.collect().await?;
+```
+
+You can also use stream output to iterate the `RecordBatch`
+
+```rust
+let ctx = SessionContext::new();
+let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+let mut stream = df.execute_stream().await?;
+while let Some(rb) = stream.next().await {
+ println!("{rb:?}");
+}
+```
+
+# Write DataFrame to Files
+
+You can also serialize `DataFrame` to a file. For now, `Datafusion` supports
write `DataFrame` to `csv`, `json` and `parquet`.
+
+Before writing to a file, it will call collect to calculate all the results of
the DataFrame and then write to file.
+
+For example, if you write it to a csv_file
Review Comment:
```suggestion
For example, to write a csv_file
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]