This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new db6701632fb [DOCS] Update python/rust page and links (#12788)
db6701632fb is described below
commit db6701632fbac4ce428d25de4f2b5c01f94a4b45
Author: Shiyan Xu <[email protected]>
AuthorDate: Wed Feb 5 19:12:05 2025 -0600
[DOCS] Update python/rust page and links (#12788)
Update python/rust quick start page to be in sync with hudi-rs readme. And
fix daft hudi page links.
---
website/docs/python-rust-quick-start-guide.md | 210 +++++++++++++++++----
website/src/pages/ecosystem.md | 2 +-
.../python-rust-quick-start-guide.md | 210 +++++++++++++++++----
.../version-1.0.0/python-rust-quick-start-guide.md | 210 +++++++++++++++++----
4 files changed, 508 insertions(+), 124 deletions(-)
diff --git a/website/docs/python-rust-quick-start-guide.md
b/website/docs/python-rust-quick-start-guide.md
index b4aecb8d958..c1d216c1607 100644
--- a/website/docs/python-rust-quick-start-guide.md
+++ b/website/docs/python-rust-quick-start-guide.md
@@ -6,7 +6,7 @@ last_modified_at: 2024-11-28T12:53:57+08:00
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-This guide will help you get started with
[hudi-rs](https://github.com/apache/hudi-rs), a native Rust library for Apache
Hudi with Python bindings. Learn how to install, set up, and perform basic
operations using both Python and Rust interfaces.
+This guide will help you get started with
[Hudi-rs](https://github.com/apache/hudi-rs), the native Rust implementation
for Apache Hudi with Python bindings. Learn how to install, set up, and perform
basic operations using both Python and Rust interfaces.
## Installation
@@ -18,48 +18,172 @@ pip install hudi
cargo add hudi
```
-## Basic Usage
+## Usage Examples
-:::note
-Currently, write capabilities and reading from MOR tables are not supported.
+> [!NOTE]
+> These examples expect a Hudi table exists at `/tmp/trips_table`, created
using
+> the [quick start guide](/docs/quick-start-guide).
-The examples below expect a Hudi table exists at `/tmp/trips_table`, created
using the [quick start guide](/docs/quick-start-guide).
-:::
+### Snapshot Query
-### Python Example
+Snapshot query reads the latest version of the data from the table. The table
API also accepts partition filters.
+
+#### Python
```python
from hudi import HudiTableBuilder
import pyarrow as pa
+hudi_table = HudiTableBuilder.from_base_uri("/tmp/trips_table").build()
+batches = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+
+# convert to PyArrow table
+arrow_table = pa.Table.from_batches(batches)
+result = arrow_table.select(["rider", "city", "ts", "fare"])
+print(result)
+```
+
+#### Rust
+
+```rust
+use hudi::error::Result;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
+use arrow::compute::concat_batches;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+ let hudi_table =
HudiTableBuilder::from_base_uri("/tmp/trips_table").build().await?;
+ let batches = hudi_table.read_snapshot(&[("city", "=",
"san_francisco")]).await?;
+ let batch = concat_batches(&batches[0].schema(), &batches)?;
+ let columns = vec!["rider", "city", "ts", "fare"];
+ for col_name in columns {
+ let idx = batch.schema().index_of(col_name).unwrap();
+ println!("{}: {}", col_name, batch.column(idx));
+ }
+ Ok(())
+}
+```
+
+To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set
`hoodie.read.use.read_optimized.mode` when creating the table.
+
+#### Python
+
+```python
hudi_table = (
HudiTableBuilder
.from_base_uri("/tmp/trips_table")
+ .with_option("hoodie.read.use.read_optimized.mode", "true")
.build()
)
+```
+
+#### Rust
+
+```rust
+let hudi_table =
+ HudiTableBuilder::from_base_uri("/tmp/trips_table")
+ .with_option("hoodie.read.use.read_optimized.mode", "true")
+ .build().await?;
+```
+
+> [!NOTE]
+> Currently reading MOR tables is limited to tables with Parquet data blocks.
+
+### Time-Travel Query
+
+Time-travel query reads the data at a specific timestamp from the table. The
table API also accepts partition filters.
+
+#### Python
+
+```python
+batches = (
+ hudi_table
+ .read_snapshot_as_of("20241231123456789", filters=[("city", "=",
"san_francisco")])
+)
+```
-# Read with partition filters
-records = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+#### Rust
-# Convert to PyArrow table
-arrow_table = pa.Table.from_batches(records)
-result = arrow_table.select(["rider", "city", "ts", "fare"])
+```rust
+let batches =
+ hudi_table
+ .read_snapshot_as_of("20241231123456789", &[("city", "=",
"san_francisco")]).await?;
```
-### Rust Example (with DataFusion)
+### Incremental Query
-1. Set up your project:
+Incremental query reads the changed data from the table for a given time range.
-```bash
+#### Python
+
+```python
+# read the records between t1 (exclusive) and t2 (inclusive)
+batches = hudi_table.read_incremental_records(t1, t2)
+
+# read the records after t1
+batches = hudi_table.read_incremental_records(t1)
+```
+
+#### Rust
+
+```rust
+// read the records between t1 (exclusive) and t2 (inclusive)
+let batches = hudi_table.read_incremental_records(t1, Some(t2)).await?;
+
+// read the records after t1
+let batches = hudi_table.read_incremental_records(t1, None).await?;
+```
+
+> [!NOTE]
+> Currently the only supported format for the timestamp arguments is Hudi
Timeline format: `yyyyMMddHHmmssSSS` or `yyyyMMddHHmmss`.
+
+## Query Engine Integration
+
+Hudi-rs provides APIs to support integration with query engines. The sections
below highlight some commonly used APIs.
+
+### Table API
+
+Create a Hudi table instance using its constructor or the `TableBuilder` API.
+
+| Stage | API | Description
|
+|-----------------|-------------------------------------------|--------------------------------------------------------------------------------|
+| Query planning | `get_file_slices()` | For snapshot
query, get a list of file slices. |
+| | `get_file_slices_splits()` | For snapshot
query, get a list of file slices in splits. |
+| | `get_file_slices_as_of()` | For
time-travel query, get a list of file slices at a given time. |
+| | `get_file_slices_splits_as_of()` | For
time-travel query, get a list of file slices in splits at a given time. |
+| | `get_file_slices_between()` | For
incremental query, get a list of changed file slices between a time range. |
+| Query execution | `create_file_group_reader_with_options()` | Create a file
group reader instance with the table instance's configs. |
+
+### File Group API
+
+Create a Hudi file group reader instance using its constructor or the Hudi
table API `create_file_group_reader_with_options()`.
+
+| Stage | API | Description
|
+|-----------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Query execution | `read_file_slice()` | Read records from
a given file slice; based on the configs, read records from only base file, or
from base file and log files, and merge records based on the configured
strategy. |
+
+
+### Apache DataFusion
+
+Enabling the `hudi` crate with `datafusion` feature will provide a
[DataFusion](https://datafusion.apache.org/)
+extension to query Hudi tables.
+
+<details>
+<summary>Add crate hudi with datafusion feature to your application to query a
Hudi table.</summary>
+
+```shell
cargo new my_project --bin && cd my_project
-cargo add tokio@1 datafusion@42
+cargo add tokio@1 datafusion@43
cargo add hudi --features datafusion
```
-1. Add code to `src/main.rs`:
+Update `src/main.rs` with the code snippet below then `cargo run`.
+
+</details>
```rust
use std::sync::Arc;
+
use datafusion::error::Result;
use datafusion::prelude::{DataFrame, SessionContext};
use hudi::HudiDataSource;
@@ -67,18 +191,32 @@ use hudi::HudiDataSource;
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
- let hudi = HudiDataSource::new_with_options("/tmp/trips_table", []).await?;
+ let hudi = HudiDataSource::new_with_options(
+ "/tmp/trips_table",
+ [("hoodie.read.input.partitions", "5")]).await?;
ctx.register_table("trips_table", Arc::new(hudi))?;
- // Read with partition filters
let df: DataFrame = ctx.sql("SELECT * from trips_table where city =
'san_francisco'").await?;
df.show().await?;
Ok(())
}
```
-## Cloud Storage Integration
+### Other Integrations
+
+Hudi is also integrated with
+
+- [Daft](https://www.getdaft.io/projects/docs/en/stable/integrations/hudi/)
+-
[Ray](https://docs.ray.io/en/latest/data/api/doc/ray.data.read_hudi.html#ray.data.read_hudi)
+
+### Work with cloud storage
+
+Ensure cloud storage credentials are set properly as environment variables,
e.g., `AWS_*`, `AZURE_*`, or `GOOGLE_*`.
+Relevant storage environment variables will then be picked up. The target
table's base uri with schemes such
+as `s3://`, `az://`, or `gs://` will be processed accordingly.
+
+Alternatively, you can pass the storage configuration as options via Table
APIs.
-### Python
+#### Python
```python
from hudi import HudiTableBuilder
@@ -91,29 +229,19 @@ hudi_table = (
)
```
-### Rust
+#### Rust
```rust
-use hudi::HudiDataSource;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
-let hudi = HudiDataSource::new_with_options(
- "s3://bucket/trips_table",
- [("aws_region", "us-west-2")]
-).await?;
+async fn main() -> Result<()> {
+ let hudi_table =
+ HudiTableBuilder::from_base_uri("s3://bucket/trips_table")
+ .with_option("aws_region", "us-west-2")
+ .build().await?;
+}
```
-### Supported Cloud Storage
-
-- AWS S3 (`s3://`)
-- Azure Storage (`az://`)
-- Google Cloud Storage (`gs://`)
-
-Set appropriate environment variables (`AWS_*`, `AZURE_*`, or `GOOGLE_*`) for
authentication, or pass through the `option()` API.
+## Contributing
-## Read with Timestamp
-
-Add timestamp option for time-travel queries:
-
-```python
-.with_option("hoodie.read.as.of.timestamp", "20241122010827898")
-```
+Check out the [contributing
guide](https://github.com/apache/hudi-rs/blob/main/CONTRIBUTING.md) for all the
details about making contributions to the project.
diff --git a/website/src/pages/ecosystem.md b/website/src/pages/ecosystem.md
index 52857120b26..563beeb4a31 100644
--- a/website/src/pages/ecosystem.md
+++ b/website/src/pages/ecosystem.md
@@ -37,5 +37,5 @@ In such cases, you can leverage another tool like Apache
Spark or Apache Flink t
| Apache Doris |
[Read](https://doris.apache.org/docs/ecosystem/external-table/hudi-external-table/)
| |
| Starrocks |
[Read](https://docs.starrocks.io/docs/data_source/catalog/hudi_catalog/)
| [Demo with HMS +
Min.IO](https://github.com/StarRocks/demo/tree/master/documentation-samples/hudi)
|
| Dremio |
| |
-| Daft |
[Read](https://www.getdaft.io/projects/docs/en/stable/user_guide/integrations/hudi.html)
| |
+| Daft |
[Read](https://www.getdaft.io/projects/docs/en/stable/integrations/hudi/)
| |
| Ray Data |
[Read](https://docs.ray.io/en/master/data/api/input_output.html#hudi)
| |
diff --git
a/website/versioned_docs/version-0.15.0/python-rust-quick-start-guide.md
b/website/versioned_docs/version-0.15.0/python-rust-quick-start-guide.md
index 73f22a1c673..7955904d807 100644
--- a/website/versioned_docs/version-0.15.0/python-rust-quick-start-guide.md
+++ b/website/versioned_docs/version-0.15.0/python-rust-quick-start-guide.md
@@ -6,7 +6,7 @@ last_modified_at: 2024-11-28T12:53:57+08:00
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-This guide will help you get started with
[hudi-rs](https://github.com/apache/hudi-rs), a native Rust library for Apache
Hudi with Python bindings. Learn how to install, set up, and perform basic
operations using both Python and Rust interfaces.
+This guide will help you get started with
[Hudi-rs](https://github.com/apache/hudi-rs), the native Rust implementation
for Apache Hudi with Python bindings. Learn how to install, set up, and perform
basic operations using both Python and Rust interfaces.
## Installation
@@ -18,48 +18,172 @@ pip install hudi
cargo add hudi
```
-## Basic Usage
+## Usage Examples
-:::note
-Currently, write capabilities and reading from MOR tables are not supported.
+> [!NOTE]
+> These examples expect a Hudi table exists at `/tmp/trips_table`, created
using
+> the [quick start guide](/docs/quick-start-guide).
-The examples below expect a Hudi table exists at `/tmp/trips_table`, created
using the [quick start guide](/docs/quick-start-guide).
-:::
+### Snapshot Query
-### Python Example
+Snapshot query reads the latest version of the data from the table. The table
API also accepts partition filters.
+
+#### Python
```python
from hudi import HudiTableBuilder
import pyarrow as pa
+hudi_table = HudiTableBuilder.from_base_uri("/tmp/trips_table").build()
+batches = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+
+# convert to PyArrow table
+arrow_table = pa.Table.from_batches(batches)
+result = arrow_table.select(["rider", "city", "ts", "fare"])
+print(result)
+```
+
+#### Rust
+
+```rust
+use hudi::error::Result;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
+use arrow::compute::concat_batches;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+ let hudi_table =
HudiTableBuilder::from_base_uri("/tmp/trips_table").build().await?;
+ let batches = hudi_table.read_snapshot(&[("city", "=",
"san_francisco")]).await?;
+ let batch = concat_batches(&batches[0].schema(), &batches)?;
+ let columns = vec!["rider", "city", "ts", "fare"];
+ for col_name in columns {
+ let idx = batch.schema().index_of(col_name).unwrap();
+ println!("{}: {}", col_name, batch.column(idx));
+ }
+ Ok(())
+}
+```
+
+To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set
`hoodie.read.use.read_optimized.mode` when creating the table.
+
+#### Python
+
+```python
hudi_table = (
HudiTableBuilder
.from_base_uri("/tmp/trips_table")
+ .with_option("hoodie.read.use.read_optimized.mode", "true")
.build()
)
+```
+
+#### Rust
+
+```rust
+let hudi_table =
+ HudiTableBuilder::from_base_uri("/tmp/trips_table")
+ .with_option("hoodie.read.use.read_optimized.mode", "true")
+ .build().await?;
+```
+
+> [!NOTE]
+> Currently reading MOR tables is limited to tables with Parquet data blocks.
+
+### Time-Travel Query
+
+Time-travel query reads the data at a specific timestamp from the table. The
table API also accepts partition filters.
+
+#### Python
+
+```python
+batches = (
+ hudi_table
+ .read_snapshot_as_of("20241231123456789", filters=[("city", "=",
"san_francisco")])
+)
+```
-# Read with partition filters
-records = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+#### Rust
-# Convert to PyArrow table
-arrow_table = pa.Table.from_batches(records)
-result = arrow_table.select(["rider", "city", "ts", "fare"])
+```rust
+let batches =
+ hudi_table
+ .read_snapshot_as_of("20241231123456789", &[("city", "=",
"san_francisco")]).await?;
```
-### Rust Example (with DataFusion)
+### Incremental Query
-1. Set up your project:
+Incremental query reads the changed data from the table for a given time range.
-```bash
+#### Python
+
+```python
+# read the records between t1 (exclusive) and t2 (inclusive)
+batches = hudi_table.read_incremental_records(t1, t2)
+
+# read the records after t1
+batches = hudi_table.read_incremental_records(t1)
+```
+
+#### Rust
+
+```rust
+// read the records between t1 (exclusive) and t2 (inclusive)
+let batches = hudi_table.read_incremental_records(t1, Some(t2)).await?;
+
+// read the records after t1
+let batches = hudi_table.read_incremental_records(t1, None).await?;
+```
+
+> [!NOTE]
+> Currently the only supported format for the timestamp arguments is Hudi
Timeline format: `yyyyMMddHHmmssSSS` or `yyyyMMddHHmmss`.
+
+## Query Engine Integration
+
+Hudi-rs provides APIs to support integration with query engines. The sections
below highlight some commonly used APIs.
+
+### Table API
+
+Create a Hudi table instance using its constructor or the `TableBuilder` API.
+
+| Stage | API | Description
|
+|-----------------|-------------------------------------------|--------------------------------------------------------------------------------|
+| Query planning | `get_file_slices()` | For snapshot
query, get a list of file slices. |
+| | `get_file_slices_splits()` | For snapshot
query, get a list of file slices in splits. |
+| | `get_file_slices_as_of()` | For
time-travel query, get a list of file slices at a given time. |
+| | `get_file_slices_splits_as_of()` | For
time-travel query, get a list of file slices in splits at a given time. |
+| | `get_file_slices_between()` | For
incremental query, get a list of changed file slices between a time range. |
+| Query execution | `create_file_group_reader_with_options()` | Create a file
group reader instance with the table instance's configs. |
+
+### File Group API
+
+Create a Hudi file group reader instance using its constructor or the Hudi
table API `create_file_group_reader_with_options()`.
+
+| Stage | API | Description
|
+|-----------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Query execution | `read_file_slice()` | Read records from
a given file slice; based on the configs, read records from only base file, or
from base file and log files, and merge records based on the configured
strategy. |
+
+
+### Apache DataFusion
+
+Enabling the `hudi` crate with `datafusion` feature will provide a
[DataFusion](https://datafusion.apache.org/)
+extension to query Hudi tables.
+
+<details>
+<summary>Add crate hudi with datafusion feature to your application to query a
Hudi table.</summary>
+
+```shell
cargo new my_project --bin && cd my_project
-cargo add tokio@1 datafusion@42
+cargo add tokio@1 datafusion@43
cargo add hudi --features datafusion
```
-1. Add code to `src/main.rs`:
+Update `src/main.rs` with the code snippet below then `cargo run`.
+
+</details>
```rust
use std::sync::Arc;
+
use datafusion::error::Result;
use datafusion::prelude::{DataFrame, SessionContext};
use hudi::HudiDataSource;
@@ -67,18 +191,32 @@ use hudi::HudiDataSource;
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
- let hudi = HudiDataSource::new_with_options("/tmp/trips_table", []).await?;
+ let hudi = HudiDataSource::new_with_options(
+ "/tmp/trips_table",
+ [("hoodie.read.input.partitions", "5")]).await?;
ctx.register_table("trips_table", Arc::new(hudi))?;
- // Read with partition filters
let df: DataFrame = ctx.sql("SELECT * from trips_table where city =
'san_francisco'").await?;
df.show().await?;
Ok(())
}
```
-## Cloud Storage Integration
+### Other Integrations
+
+Hudi is also integrated with
+
+- [Daft](https://www.getdaft.io/projects/docs/en/stable/integrations/hudi/)
+-
[Ray](https://docs.ray.io/en/latest/data/api/doc/ray.data.read_hudi.html#ray.data.read_hudi)
+
+### Work with cloud storage
+
+Ensure cloud storage credentials are set properly as environment variables,
e.g., `AWS_*`, `AZURE_*`, or `GOOGLE_*`.
+Relevant storage environment variables will then be picked up. The target
table's base uri with schemes such
+as `s3://`, `az://`, or `gs://` will be processed accordingly.
+
+Alternatively, you can pass the storage configuration as options via Table
APIs.
-### Python
+#### Python
```python
from hudi import HudiTableBuilder
@@ -91,29 +229,19 @@ hudi_table = (
)
```
-### Rust
+#### Rust
```rust
-use hudi::HudiDataSource;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
-let hudi = HudiDataSource::new_with_options(
- "s3://bucket/trips_table",
- [("aws_region", "us-west-2")]
-).await?;
+async fn main() -> Result<()> {
+ let hudi_table =
+ HudiTableBuilder::from_base_uri("s3://bucket/trips_table")
+ .with_option("aws_region", "us-west-2")
+ .build().await?;
+}
```
-### Supported Cloud Storage
-
-- AWS S3 (`s3://`)
-- Azure Storage (`az://`)
-- Google Cloud Storage (`gs://`)
-
-Set appropriate environment variables (`AWS_*`, `AZURE_*`, or `GOOGLE_*`) for
authentication, or pass through the `option()` API.
+## Contributing
-## Read with Timestamp
-
-Add timestamp option for time-travel queries:
-
-```python
-.with_option("hoodie.read.as.of.timestamp", "20241122010827898")
-```
+Check out the [contributing
guide](https://github.com/apache/hudi-rs/blob/main/CONTRIBUTING.md) for all the
details about making contributions to the project.
diff --git
a/website/versioned_docs/version-1.0.0/python-rust-quick-start-guide.md
b/website/versioned_docs/version-1.0.0/python-rust-quick-start-guide.md
index b4aecb8d958..c1d216c1607 100644
--- a/website/versioned_docs/version-1.0.0/python-rust-quick-start-guide.md
+++ b/website/versioned_docs/version-1.0.0/python-rust-quick-start-guide.md
@@ -6,7 +6,7 @@ last_modified_at: 2024-11-28T12:53:57+08:00
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-This guide will help you get started with
[hudi-rs](https://github.com/apache/hudi-rs), a native Rust library for Apache
Hudi with Python bindings. Learn how to install, set up, and perform basic
operations using both Python and Rust interfaces.
+This guide will help you get started with
[Hudi-rs](https://github.com/apache/hudi-rs), the native Rust implementation
for Apache Hudi with Python bindings. Learn how to install, set up, and perform
basic operations using both Python and Rust interfaces.
## Installation
@@ -18,48 +18,172 @@ pip install hudi
cargo add hudi
```
-## Basic Usage
+## Usage Examples
-:::note
-Currently, write capabilities and reading from MOR tables are not supported.
+> [!NOTE]
+> These examples expect a Hudi table exists at `/tmp/trips_table`, created
using
+> the [quick start guide](/docs/quick-start-guide).
-The examples below expect a Hudi table exists at `/tmp/trips_table`, created
using the [quick start guide](/docs/quick-start-guide).
-:::
+### Snapshot Query
-### Python Example
+Snapshot query reads the latest version of the data from the table. The table
API also accepts partition filters.
+
+#### Python
```python
from hudi import HudiTableBuilder
import pyarrow as pa
+hudi_table = HudiTableBuilder.from_base_uri("/tmp/trips_table").build()
+batches = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+
+# convert to PyArrow table
+arrow_table = pa.Table.from_batches(batches)
+result = arrow_table.select(["rider", "city", "ts", "fare"])
+print(result)
+```
+
+#### Rust
+
+```rust
+use hudi::error::Result;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
+use arrow::compute::concat_batches;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+ let hudi_table =
HudiTableBuilder::from_base_uri("/tmp/trips_table").build().await?;
+ let batches = hudi_table.read_snapshot(&[("city", "=",
"san_francisco")]).await?;
+ let batch = concat_batches(&batches[0].schema(), &batches)?;
+ let columns = vec!["rider", "city", "ts", "fare"];
+ for col_name in columns {
+ let idx = batch.schema().index_of(col_name).unwrap();
+ println!("{}: {}", col_name, batch.column(idx));
+ }
+ Ok(())
+}
+```
+
+To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set
`hoodie.read.use.read_optimized.mode` when creating the table.
+
+#### Python
+
+```python
hudi_table = (
HudiTableBuilder
.from_base_uri("/tmp/trips_table")
+ .with_option("hoodie.read.use.read_optimized.mode", "true")
.build()
)
+```
+
+#### Rust
+
+```rust
+let hudi_table =
+ HudiTableBuilder::from_base_uri("/tmp/trips_table")
+ .with_option("hoodie.read.use.read_optimized.mode", "true")
+ .build().await?;
+```
+
+> [!NOTE]
+> Currently reading MOR tables is limited to tables with Parquet data blocks.
+
+### Time-Travel Query
+
+Time-travel query reads the data at a specific timestamp from the table. The
table API also accepts partition filters.
+
+#### Python
+
+```python
+batches = (
+ hudi_table
+ .read_snapshot_as_of("20241231123456789", filters=[("city", "=",
"san_francisco")])
+)
+```
-# Read with partition filters
-records = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
+#### Rust
-# Convert to PyArrow table
-arrow_table = pa.Table.from_batches(records)
-result = arrow_table.select(["rider", "city", "ts", "fare"])
+```rust
+let batches =
+ hudi_table
+ .read_snapshot_as_of("20241231123456789", &[("city", "=",
"san_francisco")]).await?;
```
-### Rust Example (with DataFusion)
+### Incremental Query
-1. Set up your project:
+Incremental query reads the changed data from the table for a given time range.
-```bash
+#### Python
+
+```python
+# read the records between t1 (exclusive) and t2 (inclusive)
+batches = hudi_table.read_incremental_records(t1, t2)
+
+# read the records after t1
+batches = hudi_table.read_incremental_records(t1)
+```
+
+#### Rust
+
+```rust
+// read the records between t1 (exclusive) and t2 (inclusive)
+let batches = hudi_table.read_incremental_records(t1, Some(t2)).await?;
+
+// read the records after t1
+let batches = hudi_table.read_incremental_records(t1, None).await?;
+```
+
+> [!NOTE]
+> Currently the only supported format for the timestamp arguments is Hudi
Timeline format: `yyyyMMddHHmmssSSS` or `yyyyMMddHHmmss`.
+
+## Query Engine Integration
+
+Hudi-rs provides APIs to support integration with query engines. The sections
below highlight some commonly used APIs.
+
+### Table API
+
+Create a Hudi table instance using its constructor or the `TableBuilder` API.
+
+| Stage | API | Description
|
+|-----------------|-------------------------------------------|--------------------------------------------------------------------------------|
+| Query planning | `get_file_slices()` | For snapshot
query, get a list of file slices. |
+| | `get_file_slices_splits()` | For snapshot
query, get a list of file slices in splits. |
+| | `get_file_slices_as_of()` | For
time-travel query, get a list of file slices at a given time. |
+| | `get_file_slices_splits_as_of()` | For
time-travel query, get a list of file slices in splits at a given time. |
+| | `get_file_slices_between()` | For
incremental query, get a list of changed file slices between a time range. |
+| Query execution | `create_file_group_reader_with_options()` | Create a file
group reader instance with the table instance's configs. |
+
+### File Group API
+
+Create a Hudi file group reader instance using its constructor or the Hudi
table API `create_file_group_reader_with_options()`.
+
+| Stage | API | Description
|
+|-----------------|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Query execution | `read_file_slice()` | Read records from
a given file slice; based on the configs, read records from only base file, or
from base file and log files, and merge records based on the configured
strategy. |
+
+
+### Apache DataFusion
+
+Enabling the `hudi` crate with `datafusion` feature will provide a
[DataFusion](https://datafusion.apache.org/)
+extension to query Hudi tables.
+
+<details>
+<summary>Add crate hudi with datafusion feature to your application to query a
Hudi table.</summary>
+
+```shell
cargo new my_project --bin && cd my_project
-cargo add tokio@1 datafusion@42
+cargo add tokio@1 datafusion@43
cargo add hudi --features datafusion
```
-1. Add code to `src/main.rs`:
+Update `src/main.rs` with the code snippet below then `cargo run`.
+
+</details>
```rust
use std::sync::Arc;
+
use datafusion::error::Result;
use datafusion::prelude::{DataFrame, SessionContext};
use hudi::HudiDataSource;
@@ -67,18 +191,32 @@ use hudi::HudiDataSource;
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
- let hudi = HudiDataSource::new_with_options("/tmp/trips_table", []).await?;
+ let hudi = HudiDataSource::new_with_options(
+ "/tmp/trips_table",
+ [("hoodie.read.input.partitions", "5")]).await?;
ctx.register_table("trips_table", Arc::new(hudi))?;
- // Read with partition filters
let df: DataFrame = ctx.sql("SELECT * from trips_table where city =
'san_francisco'").await?;
df.show().await?;
Ok(())
}
```
-## Cloud Storage Integration
+### Other Integrations
+
+Hudi is also integrated with
+
+- [Daft](https://www.getdaft.io/projects/docs/en/stable/integrations/hudi/)
+-
[Ray](https://docs.ray.io/en/latest/data/api/doc/ray.data.read_hudi.html#ray.data.read_hudi)
+
+### Work with cloud storage
+
+Ensure cloud storage credentials are set properly as environment variables,
e.g., `AWS_*`, `AZURE_*`, or `GOOGLE_*`.
+Relevant storage environment variables will then be picked up. The target
table's base uri with schemes such
+as `s3://`, `az://`, or `gs://` will be processed accordingly.
+
+Alternatively, you can pass the storage configuration as options via Table
APIs.
-### Python
+#### Python
```python
from hudi import HudiTableBuilder
@@ -91,29 +229,19 @@ hudi_table = (
)
```
-### Rust
+#### Rust
```rust
-use hudi::HudiDataSource;
+use hudi::table::builder::TableBuilder as HudiTableBuilder;
-let hudi = HudiDataSource::new_with_options(
- "s3://bucket/trips_table",
- [("aws_region", "us-west-2")]
-).await?;
+async fn main() -> Result<()> {
+ let hudi_table =
+ HudiTableBuilder::from_base_uri("s3://bucket/trips_table")
+ .with_option("aws_region", "us-west-2")
+ .build().await?;
+}
```
-### Supported Cloud Storage
-
-- AWS S3 (`s3://`)
-- Azure Storage (`az://`)
-- Google Cloud Storage (`gs://`)
-
-Set appropriate environment variables (`AWS_*`, `AZURE_*`, or `GOOGLE_*`) for
authentication, or pass through the `option()` API.
+## Contributing
-## Read with Timestamp
-
-Add timestamp option for time-travel queries:
-
-```python
-.with_option("hoodie.read.as.of.timestamp", "20241122010827898")
-```
+Check out the [contributing
guide](https://github.com/apache/hudi-rs/blob/main/CONTRIBUTING.md) for all the
details about making contributions to the project.