(hudi) branch asf-site updated: [DOCS] Update python/rust page and links (#12788)

xushiyan Wed, 05 Feb 2025 17:12:25 -0800

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new db6701632fb [DOCS] Update python/rust page and links (#12788)
db6701632fb is described below

commit db6701632fbac4ce428d25de4f2b5c01f94a4b45
Author: Shiyan Xu <[email protected]>
AuthorDate: Wed Feb 5 19:12:05 2025 -0600

    [DOCS] Update python/rust page and links (#12788)
    
    Update python/rust quick start page to be in sync with hudi-rs readme. And 
fix daft hudi page links.
---
 website/docs/python-rust-quick-start-guide.md      | 210 +++++++++++++++++----
 website/src/pages/ecosystem.md                     |   2 +-
 .../python-rust-quick-start-guide.md               | 210 +++++++++++++++++----
 .../version-1.0.0/python-rust-quick-start-guide.md | 210 +++++++++++++++++----
 4 files changed, 508 insertions(+), 124 deletions(-)

diff --git a/website/docs/python-rust-quick-start-guide.md 
b/website/docs/python-rust-quick-start-guide.md
index b4aecb8d958..c1d216c1607 100644
--- a/website/docs/python-rust-quick-start-guide.md
+++ b/website/docs/python-rust-quick-start-guide.md
@@ -6,7 +6,7 @@ last_modified_at: 2024-11-28T12:53:57+08:00
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
-This guide will help you get started with 
[hudi-rs](https://github.com/apache/hudi-rs), a native Rust library for Apache 
Hudi with Python bindings. Learn how to install, set up, and perform basic 
operations using both Python and Rust interfaces.
+This guide will help you get started with 
[Hudi-rs](https://github.com/apache/hudi-rs), the native Rust implementation 
for Apache Hudi with Python bindings. Learn how to install, set up, and perform 
basic operations using both Python and Rust interfaces.
 
 ## Installation
 
@@ -18,48 +18,172 @@ pip install hudi
 cargo add hudi
 ```
 
-## Basic Usage
+## Usage Examples
 
-:::note
-Currently, write capabilities and reading from MOR tables are not supported.
+> [!NOTE]
+> These examples expect a Hudi table exists at `/tmp/trips_table`, created 
using
+> the [quick start guide](/docs/quick-start-guide).
 
-The examples below expect a Hudi table exists at `/tmp/trips_table`, created 
using the [quick start guide](/docs/quick-start-guide).
-:::
+### Snapshot Query
 
-### Python Example
+Snapshot query reads the latest version of the data from the table. The table 
API also accepts partition filters.
+
+#### Python
 
 ```python
 from hudi import HudiTableBuilder
 import pyarrow as pa
 
+hudi_table = HudiTableBuilder.from_base_uri("/tmp/trips_table").build()
+batches = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+
+# convert to PyArrow table
+arrow_table = pa.Table.from_batches(batches)
+result = arrow_table.select(["rider", "city", "ts", "fare"])
+print(result)
+```
+
+#### Rust
+
+```rust
+use hudi::error::Result;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
+use arrow::compute::concat_batches;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let hudi_table = 
HudiTableBuilder::from_base_uri("/tmp/trips_table").build().await?;
+    let batches = hudi_table.read_snapshot(&[("city", "=", 
"san_francisco")]).await?;
+    let batch = concat_batches(&batches[0].schema(), &batches)?;
+    let columns = vec!["rider", "city", "ts", "fare"];
+    for col_name in columns {
+        let idx = batch.schema().index_of(col_name).unwrap();
+        println!("{}: {}", col_name, batch.column(idx));
+    }
+    Ok(())
+}
+```
+
+To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set 
`hoodie.read.use.read_optimized.mode` when creating the table.
+
+#### Python
+
+```python
 hudi_table = (
     HudiTableBuilder
     .from_base_uri("/tmp/trips_table")
+    .with_option("hoodie.read.use.read_optimized.mode", "true")
     .build()
 )
+```
+
+#### Rust
+
+```rust
+let hudi_table = 
+    HudiTableBuilder::from_base_uri("/tmp/trips_table")
+    .with_option("hoodie.read.use.read_optimized.mode", "true")
+    .build().await?;
+```
+
+> [!NOTE]
+> Currently reading MOR tables is limited to tables with Parquet data blocks.
+
+### Time-Travel Query
+
+Time-travel query reads the data at a specific timestamp from the table. The 
table API also accepts partition filters.
+
+#### Python
+
+```python
+batches = (
+    hudi_table
+    .read_snapshot_as_of("20241231123456789", filters=[("city", "=", 
"san_francisco")])
+)
+```
 
-# Read with partition filters
-records = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+#### Rust
 
-# Convert to PyArrow table
-arrow_table = pa.Table.from_batches(records)
-result = arrow_table.select(["rider", "city", "ts", "fare"])
+```rust
+let batches = 
+    hudi_table
+    .read_snapshot_as_of("20241231123456789", &[("city", "=", 
"san_francisco")]).await?;
 ```
 
-### Rust Example (with DataFusion)
+### Incremental Query
 
-1. Set up your project:
+Incremental query reads the changed data from the table for a given time range.
 
-```bash
+#### Python
+
+```python
+# read the records between t1 (exclusive) and t2 (inclusive)
+batches = hudi_table.read_incremental_records(t1, t2)
+
+# read the records after t1
+batches = hudi_table.read_incremental_records(t1)
+```
+
+#### Rust
+
+```rust
+// read the records between t1 (exclusive) and t2 (inclusive)
+let batches = hudi_table.read_incremental_records(t1, Some(t2)).await?;
+
+// read the records after t1
+let batches = hudi_table.read_incremental_records(t1, None).await?;
+```
+
+> [!NOTE]
+> Currently the only supported format for the timestamp arguments is Hudi 
Timeline format: `yyyyMMddHHmmssSSS` or `yyyyMMddHHmmss`.
+
+## Query Engine Integration
+
+Hudi-rs provides APIs to support integration with query engines. The sections 
below highlight some commonly used APIs.
+
+### Table API
+
+Create a Hudi table instance using its constructor or the `TableBuilder` API.
+
+| Stage           | API                                       | Description    
                                                                |
+|-----------------|-------------------------------------------|--------------------------------------------------------------------------------|
+| Query planning  | `get_file_slices()`                       | For snapshot 
query, get a list of file slices.                                 |
+|                 | `get_file_slices_splits()`                | For snapshot 
query, get a list of file slices in splits.                       |
+|                 | `get_file_slices_as_of()`                 | For 
time-travel query, get a list of file slices at a given time.              |
+|                 | `get_file_slices_splits_as_of()`          | For 
time-travel query, get a list of file slices in splits at a given time.    |
+|                 | `get_file_slices_between()`               | For 
incremental query, get a list of changed file slices between a time range. |
+| Query execution | `create_file_group_reader_with_options()` | Create a file 
group reader instance with the table instance's configs.         |
+
+### File Group API
+
+Create a Hudi file group reader instance using its constructor or the Hudi 
table API `create_file_group_reader_with_options()`.
+
+| Stage           | API                                   | Description        
                                                                                
                                                                                
|
+|-----------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Query execution | `read_file_slice()`                   | Read records from 
a given file slice; based on the configs, read records from only base file, or 
from base file and log files, and merge records based on the configured 
strategy. |
+
+
+### Apache DataFusion
+
+Enabling the `hudi` crate with `datafusion` feature will provide a 
[DataFusion](https://datafusion.apache.org/) 
+extension to query Hudi tables.
+
+<details>
+<summary>Add crate hudi with datafusion feature to your application to query a 
Hudi table.</summary>
+
+```shell
 cargo new my_project --bin && cd my_project
-cargo add tokio@1 datafusion@42
+cargo add tokio@1 datafusion@43
 cargo add hudi --features datafusion
 ```
 
-1. Add code to `src/main.rs`:
+Update `src/main.rs` with the code snippet below then `cargo run`.
+
+</details>
 
 ```rust
 use std::sync::Arc;
+
 use datafusion::error::Result;
 use datafusion::prelude::{DataFrame, SessionContext};
 use hudi::HudiDataSource;
@@ -67,18 +191,32 @@ use hudi::HudiDataSource;
 #[tokio::main]
 async fn main() -> Result<()> {
     let ctx = SessionContext::new();
-    let hudi = HudiDataSource::new_with_options("/tmp/trips_table", []).await?;
+    let hudi = HudiDataSource::new_with_options(
+        "/tmp/trips_table",
+        [("hoodie.read.input.partitions", "5")]).await?;
     ctx.register_table("trips_table", Arc::new(hudi))?;
-    // Read with partition filters
     let df: DataFrame = ctx.sql("SELECT * from trips_table where city = 
'san_francisco'").await?;
     df.show().await?;
     Ok(())
 }
 ```
 
-## Cloud Storage Integration
+### Other Integrations
+
+Hudi is also integrated with
+
+- [Daft](https://www.getdaft.io/projects/docs/en/stable/integrations/hudi/)
+- 
[Ray](https://docs.ray.io/en/latest/data/api/doc/ray.data.read_hudi.html#ray.data.read_hudi)
+
+### Work with cloud storage
+
+Ensure cloud storage credentials are set properly as environment variables, 
e.g., `AWS_*`, `AZURE_*`, or `GOOGLE_*`.
+Relevant storage environment variables will then be picked up. The target 
table's base uri with schemes such
+as `s3://`, `az://`, or `gs://` will be processed accordingly.
+
+Alternatively, you can pass the storage configuration as options via Table 
APIs.
 
-### Python
+#### Python
 
 ```python
 from hudi import HudiTableBuilder
@@ -91,29 +229,19 @@ hudi_table = (
 )
 ```
 
-### Rust
+#### Rust
 
 ```rust
-use hudi::HudiDataSource;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
 
-let hudi = HudiDataSource::new_with_options(
-    "s3://bucket/trips_table",
-    [("aws_region", "us-west-2")]
-).await?;
+async fn main() -> Result<()> {
+    let hudi_table = 
+        HudiTableBuilder::from_base_uri("s3://bucket/trips_table")
+        .with_option("aws_region", "us-west-2")
+        .build().await?;
+}
 ```
 
-### Supported Cloud Storage
-
-- AWS S3 (`s3://`)
-- Azure Storage (`az://`)
-- Google Cloud Storage (`gs://`)
-
-Set appropriate environment variables (`AWS_*`, `AZURE_*`, or `GOOGLE_*`) for 
authentication, or pass through the `option()` API.
+## Contributing
 
-## Read with Timestamp
-
-Add timestamp option for time-travel queries:
-
-```python
-.with_option("hoodie.read.as.of.timestamp", "20241122010827898")
-```
+Check out the [contributing 
guide](https://github.com/apache/hudi-rs/blob/main/CONTRIBUTING.md) for all the 
details about making contributions to the project.
diff --git a/website/src/pages/ecosystem.md b/website/src/pages/ecosystem.md
index 52857120b26..563beeb4a31 100644
--- a/website/src/pages/ecosystem.md
+++ b/website/src/pages/ecosystem.md
@@ -37,5 +37,5 @@ In such cases, you can leverage another tool like Apache 
Spark or Apache Flink t
 | Apache Doris      | 
[Read](https://doris.apache.org/docs/ecosystem/external-table/hudi-external-table/)
                                      |             |
 | Starrocks         | 
[Read](https://docs.starrocks.io/docs/data_source/catalog/hudi_catalog/)        
                                         | [Demo with HMS + 
Min.IO](https://github.com/StarRocks/demo/tree/master/documentation-samples/hudi)
            |
 | Dremio            |                                                          
                                                                |             |
-| Daft              | 
[Read](https://www.getdaft.io/projects/docs/en/stable/user_guide/integrations/hudi.html)
                                 |             |
+| Daft              | 
[Read](https://www.getdaft.io/projects/docs/en/stable/integrations/hudi/)       
                                         |             |
 | Ray Data          | 
[Read](https://docs.ray.io/en/master/data/api/input_output.html#hudi)           
                                         |             |
diff --git 
a/website/versioned_docs/version-0.15.0/python-rust-quick-start-guide.md 
b/website/versioned_docs/version-0.15.0/python-rust-quick-start-guide.md
index 73f22a1c673..7955904d807 100644
--- a/website/versioned_docs/version-0.15.0/python-rust-quick-start-guide.md
+++ b/website/versioned_docs/version-0.15.0/python-rust-quick-start-guide.md
@@ -6,7 +6,7 @@ last_modified_at: 2024-11-28T12:53:57+08:00
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
-This guide will help you get started with 
[hudi-rs](https://github.com/apache/hudi-rs), a native Rust library for Apache 
Hudi with Python bindings. Learn how to install, set up, and perform basic 
operations using both Python and Rust interfaces.
+This guide will help you get started with 
[Hudi-rs](https://github.com/apache/hudi-rs), the native Rust implementation 
for Apache Hudi with Python bindings. Learn how to install, set up, and perform 
basic operations using both Python and Rust interfaces.
 
 ## Installation
 
@@ -18,48 +18,172 @@ pip install hudi
 cargo add hudi
 ```
 
-## Basic Usage
+## Usage Examples
 
-:::note
-Currently, write capabilities and reading from MOR tables are not supported.
+> [!NOTE]
+> These examples expect a Hudi table exists at `/tmp/trips_table`, created 
using
+> the [quick start guide](/docs/quick-start-guide).
 
-The examples below expect a Hudi table exists at `/tmp/trips_table`, created 
using the [quick start guide](/docs/quick-start-guide).
-:::
+### Snapshot Query
 
-### Python Example
+Snapshot query reads the latest version of the data from the table. The table 
API also accepts partition filters.
+
+#### Python
 
 ```python
 from hudi import HudiTableBuilder
 import pyarrow as pa
 
+hudi_table = HudiTableBuilder.from_base_uri("/tmp/trips_table").build()
+batches = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+
+# convert to PyArrow table
+arrow_table = pa.Table.from_batches(batches)
+result = arrow_table.select(["rider", "city", "ts", "fare"])
+print(result)
+```
+
+#### Rust
+
+```rust
+use hudi::error::Result;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
+use arrow::compute::concat_batches;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let hudi_table = 
HudiTableBuilder::from_base_uri("/tmp/trips_table").build().await?;
+    let batches = hudi_table.read_snapshot(&[("city", "=", 
"san_francisco")]).await?;
+    let batch = concat_batches(&batches[0].schema(), &batches)?;
+    let columns = vec!["rider", "city", "ts", "fare"];
+    for col_name in columns {
+        let idx = batch.schema().index_of(col_name).unwrap();
+        println!("{}: {}", col_name, batch.column(idx));
+    }
+    Ok(())
+}
+```
+
+To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set 
`hoodie.read.use.read_optimized.mode` when creating the table.
+
+#### Python
+
+```python
 hudi_table = (
     HudiTableBuilder
     .from_base_uri("/tmp/trips_table")
+    .with_option("hoodie.read.use.read_optimized.mode", "true")
     .build()
 )
+```
+
+#### Rust
+
+```rust
+let hudi_table = 
+    HudiTableBuilder::from_base_uri("/tmp/trips_table")
+    .with_option("hoodie.read.use.read_optimized.mode", "true")
+    .build().await?;
+```
+
+> [!NOTE]
+> Currently reading MOR tables is limited to tables with Parquet data blocks.
+
+### Time-Travel Query
+
+Time-travel query reads the data at a specific timestamp from the table. The 
table API also accepts partition filters.
+
+#### Python
+
+```python
+batches = (
+    hudi_table
+    .read_snapshot_as_of("20241231123456789", filters=[("city", "=", 
"san_francisco")])
+)
+```
 
-# Read with partition filters
-records = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+#### Rust
 
-# Convert to PyArrow table
-arrow_table = pa.Table.from_batches(records)
-result = arrow_table.select(["rider", "city", "ts", "fare"])
+```rust
+let batches = 
+    hudi_table
+    .read_snapshot_as_of("20241231123456789", &[("city", "=", 
"san_francisco")]).await?;
 ```
 
-### Rust Example (with DataFusion)
+### Incremental Query
 
-1. Set up your project:
+Incremental query reads the changed data from the table for a given time range.
 
-```bash
+#### Python
+
+```python
+# read the records between t1 (exclusive) and t2 (inclusive)
+batches = hudi_table.read_incremental_records(t1, t2)
+
+# read the records after t1
+batches = hudi_table.read_incremental_records(t1)
+```
+
+#### Rust
+
+```rust
+// read the records between t1 (exclusive) and t2 (inclusive)
+let batches = hudi_table.read_incremental_records(t1, Some(t2)).await?;
+
+// read the records after t1
+let batches = hudi_table.read_incremental_records(t1, None).await?;
+```
+
+> [!NOTE]
+> Currently the only supported format for the timestamp arguments is Hudi 
Timeline format: `yyyyMMddHHmmssSSS` or `yyyyMMddHHmmss`.
+
+## Query Engine Integration
+
+Hudi-rs provides APIs to support integration with query engines. The sections 
below highlight some commonly used APIs.
+
+### Table API
+
+Create a Hudi table instance using its constructor or the `TableBuilder` API.
+
+| Stage           | API                                       | Description    
                                                                |
+|-----------------|-------------------------------------------|--------------------------------------------------------------------------------|
+| Query planning  | `get_file_slices()`                       | For snapshot 
query, get a list of file slices.                                 |
+|                 | `get_file_slices_splits()`                | For snapshot 
query, get a list of file slices in splits.                       |
+|                 | `get_file_slices_as_of()`                 | For 
time-travel query, get a list of file slices at a given time.              |
+|                 | `get_file_slices_splits_as_of()`          | For 
time-travel query, get a list of file slices in splits at a given time.    |
+|                 | `get_file_slices_between()`               | For 
incremental query, get a list of changed file slices between a time range. |
+| Query execution | `create_file_group_reader_with_options()` | Create a file 
group reader instance with the table instance's configs.         |
+
+### File Group API
+
+Create a Hudi file group reader instance using its constructor or the Hudi 
table API `create_file_group_reader_with_options()`.
+
+| Stage           | API                                   | Description        
                                                                                
                                                                                
|
+|-----------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Query execution | `read_file_slice()`                   | Read records from 
a given file slice; based on the configs, read records from only base file, or 
from base file and log files, and merge records based on the configured 
strategy. |
+
+
+### Apache DataFusion
+
+Enabling the `hudi` crate with `datafusion` feature will provide a 
[DataFusion](https://datafusion.apache.org/) 
+extension to query Hudi tables.
+
+<details>
+<summary>Add crate hudi with datafusion feature to your application to query a 
Hudi table.</summary>
+
+```shell
 cargo new my_project --bin && cd my_project
-cargo add tokio@1 datafusion@42
+cargo add tokio@1 datafusion@43
 cargo add hudi --features datafusion
 ```
 
-1. Add code to `src/main.rs`:
+Update `src/main.rs` with the code snippet below then `cargo run`.
+
+</details>
 
 ```rust
 use std::sync::Arc;
+
 use datafusion::error::Result;
 use datafusion::prelude::{DataFrame, SessionContext};
 use hudi::HudiDataSource;
@@ -67,18 +191,32 @@ use hudi::HudiDataSource;
 #[tokio::main]
 async fn main() -> Result<()> {
     let ctx = SessionContext::new();
-    let hudi = HudiDataSource::new_with_options("/tmp/trips_table", []).await?;
+    let hudi = HudiDataSource::new_with_options(
+        "/tmp/trips_table",
+        [("hoodie.read.input.partitions", "5")]).await?;
     ctx.register_table("trips_table", Arc::new(hudi))?;
-    // Read with partition filters
     let df: DataFrame = ctx.sql("SELECT * from trips_table where city = 
'san_francisco'").await?;
     df.show().await?;
     Ok(())
 }
 ```
 
-## Cloud Storage Integration
+### Other Integrations
+
+Hudi is also integrated with
+
+- [Daft](https://www.getdaft.io/projects/docs/en/stable/integrations/hudi/)
+- 
[Ray](https://docs.ray.io/en/latest/data/api/doc/ray.data.read_hudi.html#ray.data.read_hudi)
+
+### Work with cloud storage
+
+Ensure cloud storage credentials are set properly as environment variables, 
e.g., `AWS_*`, `AZURE_*`, or `GOOGLE_*`.
+Relevant storage environment variables will then be picked up. The target 
table's base uri with schemes such
+as `s3://`, `az://`, or `gs://` will be processed accordingly.
+
+Alternatively, you can pass the storage configuration as options via Table 
APIs.
 
-### Python
+#### Python
 
 ```python
 from hudi import HudiTableBuilder
@@ -91,29 +229,19 @@ hudi_table = (
 )
 ```
 
-### Rust
+#### Rust
 
 ```rust
-use hudi::HudiDataSource;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
 
-let hudi = HudiDataSource::new_with_options(
-    "s3://bucket/trips_table",
-    [("aws_region", "us-west-2")]
-).await?;
+async fn main() -> Result<()> {
+    let hudi_table = 
+        HudiTableBuilder::from_base_uri("s3://bucket/trips_table")
+        .with_option("aws_region", "us-west-2")
+        .build().await?;
+}
 ```
 
-### Supported Cloud Storage
-
-- AWS S3 (`s3://`)
-- Azure Storage (`az://`)
-- Google Cloud Storage (`gs://`)
-
-Set appropriate environment variables (`AWS_*`, `AZURE_*`, or `GOOGLE_*`) for 
authentication, or pass through the `option()` API.
+## Contributing
 
-## Read with Timestamp
-
-Add timestamp option for time-travel queries:
-
-```python
-.with_option("hoodie.read.as.of.timestamp", "20241122010827898")
-```
+Check out the [contributing 
guide](https://github.com/apache/hudi-rs/blob/main/CONTRIBUTING.md) for all the 
details about making contributions to the project.
diff --git 
a/website/versioned_docs/version-1.0.0/python-rust-quick-start-guide.md 
b/website/versioned_docs/version-1.0.0/python-rust-quick-start-guide.md
index b4aecb8d958..c1d216c1607 100644
--- a/website/versioned_docs/version-1.0.0/python-rust-quick-start-guide.md
+++ b/website/versioned_docs/version-1.0.0/python-rust-quick-start-guide.md
@@ -6,7 +6,7 @@ last_modified_at: 2024-11-28T12:53:57+08:00
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
-This guide will help you get started with 
[hudi-rs](https://github.com/apache/hudi-rs), a native Rust library for Apache 
Hudi with Python bindings. Learn how to install, set up, and perform basic 
operations using both Python and Rust interfaces.
+This guide will help you get started with 
[Hudi-rs](https://github.com/apache/hudi-rs), the native Rust implementation 
for Apache Hudi with Python bindings. Learn how to install, set up, and perform 
basic operations using both Python and Rust interfaces.
 
 ## Installation
 
@@ -18,48 +18,172 @@ pip install hudi
 cargo add hudi
 ```
 
-## Basic Usage
+## Usage Examples
 
-:::note
-Currently, write capabilities and reading from MOR tables are not supported.
+> [!NOTE]
+> These examples expect a Hudi table exists at `/tmp/trips_table`, created 
using
+> the [quick start guide](/docs/quick-start-guide).
 
-The examples below expect a Hudi table exists at `/tmp/trips_table`, created 
using the [quick start guide](/docs/quick-start-guide).
-:::
+### Snapshot Query
 
-### Python Example
+Snapshot query reads the latest version of the data from the table. The table 
API also accepts partition filters.
+
+#### Python
 
 ```python
 from hudi import HudiTableBuilder
 import pyarrow as pa
 
+hudi_table = HudiTableBuilder.from_base_uri("/tmp/trips_table").build()
+batches = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+
+# convert to PyArrow table
+arrow_table = pa.Table.from_batches(batches)
+result = arrow_table.select(["rider", "city", "ts", "fare"])
+print(result)
+```
+
+#### Rust
+
+```rust
+use hudi::error::Result;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
+use arrow::compute::concat_batches;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+    let hudi_table = 
HudiTableBuilder::from_base_uri("/tmp/trips_table").build().await?;
+    let batches = hudi_table.read_snapshot(&[("city", "=", 
"san_francisco")]).await?;
+    let batch = concat_batches(&batches[0].schema(), &batches)?;
+    let columns = vec!["rider", "city", "ts", "fare"];
+    for col_name in columns {
+        let idx = batch.schema().index_of(col_name).unwrap();
+        println!("{}: {}", col_name, batch.column(idx));
+    }
+    Ok(())
+}
+```
+
+To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set 
`hoodie.read.use.read_optimized.mode` when creating the table.
+
+#### Python
+
+```python
 hudi_table = (
     HudiTableBuilder
     .from_base_uri("/tmp/trips_table")
+    .with_option("hoodie.read.use.read_optimized.mode", "true")
     .build()
 )
+```
+
+#### Rust
+
+```rust
+let hudi_table = 
+    HudiTableBuilder::from_base_uri("/tmp/trips_table")
+    .with_option("hoodie.read.use.read_optimized.mode", "true")
+    .build().await?;
+```
+
+> [!NOTE]
+> Currently reading MOR tables is limited to tables with Parquet data blocks.
+
+### Time-Travel Query
+
+Time-travel query reads the data at a specific timestamp from the table. The 
table API also accepts partition filters.
+
+#### Python
+
+```python
+batches = (
+    hudi_table
+    .read_snapshot_as_of("20241231123456789", filters=[("city", "=", 
"san_francisco")])
+)
+```
 
-# Read with partition filters
-records = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+#### Rust
 
-# Convert to PyArrow table
-arrow_table = pa.Table.from_batches(records)
-result = arrow_table.select(["rider", "city", "ts", "fare"])
+```rust
+let batches = 
+    hudi_table
+    .read_snapshot_as_of("20241231123456789", &[("city", "=", 
"san_francisco")]).await?;
 ```
 
-### Rust Example (with DataFusion)
+### Incremental Query
 
-1. Set up your project:
+Incremental query reads the changed data from the table for a given time range.
 
-```bash
+#### Python
+
+```python
+# read the records between t1 (exclusive) and t2 (inclusive)
+batches = hudi_table.read_incremental_records(t1, t2)
+
+# read the records after t1
+batches = hudi_table.read_incremental_records(t1)
+```
+
+#### Rust
+
+```rust
+// read the records between t1 (exclusive) and t2 (inclusive)
+let batches = hudi_table.read_incremental_records(t1, Some(t2)).await?;
+
+// read the records after t1
+let batches = hudi_table.read_incremental_records(t1, None).await?;
+```
+
+> [!NOTE]
+> Currently the only supported format for the timestamp arguments is Hudi 
Timeline format: `yyyyMMddHHmmssSSS` or `yyyyMMddHHmmss`.
+
+## Query Engine Integration
+
+Hudi-rs provides APIs to support integration with query engines. The sections 
below highlight some commonly used APIs.
+
+### Table API
+
+Create a Hudi table instance using its constructor or the `TableBuilder` API.
+
+| Stage           | API                                       | Description    
                                                                |
+|-----------------|-------------------------------------------|--------------------------------------------------------------------------------|
+| Query planning  | `get_file_slices()`                       | For snapshot 
query, get a list of file slices.                                 |
+|                 | `get_file_slices_splits()`                | For snapshot 
query, get a list of file slices in splits.                       |
+|                 | `get_file_slices_as_of()`                 | For 
time-travel query, get a list of file slices at a given time.              |
+|                 | `get_file_slices_splits_as_of()`          | For 
time-travel query, get a list of file slices in splits at a given time.    |
+|                 | `get_file_slices_between()`               | For 
incremental query, get a list of changed file slices between a time range. |
+| Query execution | `create_file_group_reader_with_options()` | Create a file 
group reader instance with the table instance's configs.         |
+
+### File Group API
+
+Create a Hudi file group reader instance using its constructor or the Hudi 
table API `create_file_group_reader_with_options()`.
+
+| Stage           | API                                   | Description        
                                                                                
                                                                                
|
+|-----------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Query execution | `read_file_slice()`                   | Read records from 
a given file slice; based on the configs, read records from only base file, or 
from base file and log files, and merge records based on the configured 
strategy. |
+
+
+### Apache DataFusion
+
+Enabling the `hudi` crate with `datafusion` feature will provide a 
[DataFusion](https://datafusion.apache.org/) 
+extension to query Hudi tables.
+
+<details>
+<summary>Add crate hudi with datafusion feature to your application to query a 
Hudi table.</summary>
+
+```shell
 cargo new my_project --bin && cd my_project
-cargo add tokio@1 datafusion@42
+cargo add tokio@1 datafusion@43
 cargo add hudi --features datafusion
 ```
 
-1. Add code to `src/main.rs`:
+Update `src/main.rs` with the code snippet below then `cargo run`.
+
+</details>
 
 ```rust
 use std::sync::Arc;
+
 use datafusion::error::Result;
 use datafusion::prelude::{DataFrame, SessionContext};
 use hudi::HudiDataSource;
@@ -67,18 +191,32 @@ use hudi::HudiDataSource;
 #[tokio::main]
 async fn main() -> Result<()> {
     let ctx = SessionContext::new();
-    let hudi = HudiDataSource::new_with_options("/tmp/trips_table", []).await?;
+    let hudi = HudiDataSource::new_with_options(
+        "/tmp/trips_table",
+        [("hoodie.read.input.partitions", "5")]).await?;
     ctx.register_table("trips_table", Arc::new(hudi))?;
-    // Read with partition filters
     let df: DataFrame = ctx.sql("SELECT * from trips_table where city = 
'san_francisco'").await?;
     df.show().await?;
     Ok(())
 }
 ```
 
-## Cloud Storage Integration
+### Other Integrations
+
+Hudi is also integrated with
+
+- [Daft](https://www.getdaft.io/projects/docs/en/stable/integrations/hudi/)
+- 
[Ray](https://docs.ray.io/en/latest/data/api/doc/ray.data.read_hudi.html#ray.data.read_hudi)
+
+### Work with cloud storage
+
+Ensure cloud storage credentials are set properly as environment variables, 
e.g., `AWS_*`, `AZURE_*`, or `GOOGLE_*`.
+Relevant storage environment variables will then be picked up. The target 
table's base uri with schemes such
+as `s3://`, `az://`, or `gs://` will be processed accordingly.
+
+Alternatively, you can pass the storage configuration as options via Table 
APIs.
 
-### Python
+#### Python
 
 ```python
 from hudi import HudiTableBuilder
@@ -91,29 +229,19 @@ hudi_table = (
 )
 ```
 
-### Rust
+#### Rust
 
 ```rust
-use hudi::HudiDataSource;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
 
-let hudi = HudiDataSource::new_with_options(
-    "s3://bucket/trips_table",
-    [("aws_region", "us-west-2")]
-).await?;
+async fn main() -> Result<()> {
+    let hudi_table = 
+        HudiTableBuilder::from_base_uri("s3://bucket/trips_table")
+        .with_option("aws_region", "us-west-2")
+        .build().await?;
+}
 ```
 
-### Supported Cloud Storage
-
-- AWS S3 (`s3://`)
-- Azure Storage (`az://`)
-- Google Cloud Storage (`gs://`)
-
-Set appropriate environment variables (`AWS_*`, `AZURE_*`, or `GOOGLE_*`) for 
authentication, or pass through the `option()` API.
+## Contributing
 
-## Read with Timestamp
-
-Add timestamp option for time-travel queries:
-
-```python
-.with_option("hoodie.read.as.of.timestamp", "20241122010827898")
-```
+Check out the [contributing 
guide](https://github.com/apache/hudi-rs/blob/main/CONTRIBUTING.md) for all the 
details about making contributions to the project.

(hudi) branch asf-site updated: [DOCS] Update python/rust page and links (#12788)

Reply via email to