This is an automated email from the ASF dual-hosted git repository.
honahx pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg-python.git
The following commit(s) were added to refs/heads/main by this push:
new 2836c4a Small getting started guide on writes (#311)
2836c4a is described below
commit 2836c4a85708d4e00fe1fa35f880728dfe074b60
Author: Fokko Driesprong <[email protected]>
AuthorDate: Tue Jan 30 08:51:30 2024 +0100
Small getting started guide on writes (#311)
---
mkdocs/docs/SUMMARY.md | 2 +-
mkdocs/docs/contributing.md | 16 +++++
mkdocs/docs/index.md | 144 +++++++++++++++++++++++++++++++++++---------
pyiceberg/table/__init__.py | 6 +-
tests/test_schema.py | 21 +++++++
5 files changed, 159 insertions(+), 30 deletions(-)
diff --git a/mkdocs/docs/SUMMARY.md b/mkdocs/docs/SUMMARY.md
index 5504d2a..40ba0bf 100644
--- a/mkdocs/docs/SUMMARY.md
+++ b/mkdocs/docs/SUMMARY.md
@@ -17,7 +17,7 @@
<!-- prettier-ignore-start -->
-- [Home](index.md)
+- [Getting started](index.md)
- [Configuration](configuration.md)
- [CLI](cli.md)
- [API](api.md)
diff --git a/mkdocs/docs/contributing.md b/mkdocs/docs/contributing.md
index 8ec6dcb..7411382 100644
--- a/mkdocs/docs/contributing.md
+++ b/mkdocs/docs/contributing.md
@@ -58,6 +58,22 @@ For IDEA ≤2021 you need to install the [Poetry integration
as a plugin](https:
Now you're set using Poetry, and all the tests will run in Poetry, and you'll
have syntax highlighting in the pyproject.toml to indicate stale dependencies.
+## Installation from source
+
+Clone the repository for local development:
+
+```sh
+git clone https://github.com/apache/iceberg-python.git
+cd iceberg-python
+pip3 install -e ".[s3fs,hive]"
+```
+
+Install it directly for GitHub (not recommended), but sometimes handy:
+
+```
+pip install
"git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"
+```
+
## Linting
`pre-commit` is used for autoformatting and linting:
diff --git a/mkdocs/docs/index.md b/mkdocs/docs/index.md
index 628f4f7..d82aa65 100644
--- a/mkdocs/docs/index.md
+++ b/mkdocs/docs/index.md
@@ -20,11 +20,11 @@ hide:
- limitations under the License.
-->
-# PyIceberg
+# Getting started with PyIceberg
PyIceberg is a Python implementation for accessing Iceberg tables, without the
need of a JVM.
-## Install
+## Installation
Before installing PyIceberg, make sure that you're on an up-to-date version of
`pip`:
@@ -38,36 +38,126 @@ You can install the latest release version from pypi:
pip install "pyiceberg[s3fs,hive]"
```
-Install it directly for GitHub (not recommended), but sometimes handy:
+You can mix and match optional dependencies depending on your needs:
+
+| Key | Description:
|
+| ------------ |
-------------------------------------------------------------------- |
+| hive | Support for the Hive metastore
|
+| glue | Support for AWS Glue
|
+| dynamodb | Support for AWS DynamoDB
|
+| sql-postgres | Support for SQL Catalog backed by Postgresql
|
+| sql-sqlite | Support for SQL Catalog backed by SQLite
|
+| pyarrow | PyArrow as a FileIO implementation to interact with the
object store |
+| pandas | Installs both PyArrow and Pandas
|
+| duckdb | Installs both PyArrow and DuckDB
|
+| ray | Installs PyArrow, Pandas, and Ray
|
+| s3fs | S3FS as a FileIO implementation to interact with the object
store |
+| adlfs | ADLFS as a FileIO implementation to interact with the object
store |
+| snappy | Support for snappy Avro compression
|
+| gcs | GCS as the FileIO implementation to interact with the object
store |
+
+You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to
fetch files from an object store.
+
+## Connecting to a catalog
+
+Iceberg leverages the [catalog to have one centralized place to organize the
tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive
catalog to store your Iceberg tables next to the rest, a vendor solution like
the AWS Glue catalog, or an implementation of Icebergs' own [REST
protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the
[configuration](configuration.md) page to find all the configuration details.
+
+## Write a PyArrow dataframe
+
+Let's take the Taxi dataset, and write this to an Iceberg table.
+
+First download one month of data:
+
+```shell
+curl
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
-o /tmp/yellow_tripdata_2023-01.parquet
+```
+
+Load it into your PyArrow dataframe:
+
+```python
+import pyarrow.parquet as pq
+df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
```
-pip install
"git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]"
+
+Create a new Iceberg table:
+
+```python
+from pyiceberg.catalog import load_catalog
+
+catalog = load_catalog("default")
+
+table = catalog.create_table(
+ "default.taxi_dataset",
+ schema=df.schema,
+)
```
-Or clone the repository for local development:
+Append the dataframe to the table:
-```sh
-git clone https://github.com/apache/iceberg-python.git
-cd iceberg-python
-pip3 install -e ".[s3fs,hive]"
+```python
+table.append(df)
+len(table.scan().to_arrow())
```
-You can mix and match optional dependencies depending on your needs:
+3066766 rows have been written to the table.
+
+Now generate a tip-per-mile feature to train the model on:
+
+```python
+import pyarrow.compute as pc
+
+df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"],
df["trip_distance"]))
+```
+
+Evolve the schema of the table with the new column:
+
+```python
+with table.update_schema() as update_schema:
+ update_schema.union_by_name(df.schema)
+```
+
+And now we can write the new dataframe to the Iceberg table:
+
+```python
+table.overwrite(df)
+print(table.scan().to_arrow())
+```
+
+And the new column is there:
+
+```
+taxi_dataset(
+ 1: VendorID: optional long,
+ 2: tpep_pickup_datetime: optional timestamp,
+ 3: tpep_dropoff_datetime: optional timestamp,
+ 4: passenger_count: optional double,
+ 5: trip_distance: optional double,
+ 6: RatecodeID: optional double,
+ 7: store_and_fwd_flag: optional string,
+ 8: PULocationID: optional long,
+ 9: DOLocationID: optional long,
+ 10: payment_type: optional long,
+ 11: fare_amount: optional double,
+ 12: extra: optional double,
+ 13: mta_tax: optional double,
+ 14: tip_amount: optional double,
+ 15: tolls_amount: optional double,
+ 16: improvement_surcharge: optional double,
+ 17: total_amount: optional double,
+ 18: congestion_surcharge: optional double,
+ 19: airport_fee: optional double,
+ 20: tip_per_mile: optional double
+),
+```
+
+And we can see that 2371784 rows have a tip-per-mile:
+
+```python
+df = table.scan(row_filter="tip_per_mile > 0").to_arrow()
+len(df)
+```
+
+## More details
-| Key | Description:
|
-| -------- |
-------------------------------------------------------------------- |
-| hive | Support for the Hive metastore
|
-| glue | Support for AWS Glue
|
-| dynamodb | Support for AWS DynamoDB
|
-| pyarrow | PyArrow as a FileIO implementation to interact with the object
store |
-| pandas | Installs both PyArrow and Pandas
|
-| duckdb | Installs both PyArrow and DuckDB
|
-| ray | Installs PyArrow, Pandas, and Ray
|
-| s3fs | S3FS as a FileIO implementation to interact with the object store
|
-| adlfs | ADLFS as a FileIO implementation to interact with the object
store |
-| snappy | Support for snappy Avro compression
|
-| gcs | GCS as the FileIO implementation to interact with the object
store |
-
-You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` for fetching
files.
-
-There is both a [CLI](cli.md) and [Python API](api.md) available.
+For the details, please check the [CLI](cli.md) or [Python API](api.md) page.
diff --git a/pyiceberg/table/__init__.py b/pyiceberg/table/__init__.py
index 26eecef..16ed9ed 100644
--- a/pyiceberg/table/__init__.py
+++ b/pyiceberg/table/__init__.py
@@ -1477,9 +1477,11 @@ class UpdateSchema:
self._case_sensitive = case_sensitive
return self
- def union_by_name(self, new_schema: Schema) -> UpdateSchema:
+ def union_by_name(self, new_schema: Union[Schema, "pa.Schema"]) ->
UpdateSchema:
+ from pyiceberg.catalog import Catalog
+
visit_with_partner(
- new_schema,
+ Catalog._convert_schema_if_needed(new_schema),
-1,
UnionByNameVisitor(update_schema=self,
existing_schema=self._schema, case_sensitive=self._case_sensitive), # type:
ignore
PartnerIdByNameAccessor(partner_schema=self._schema,
case_sensitive=self._case_sensitive),
diff --git a/tests/test_schema.py b/tests/test_schema.py
index a5487b7..cfee6e7 100644
--- a/tests/test_schema.py
+++ b/tests/test_schema.py
@@ -18,6 +18,7 @@
from textwrap import dedent
from typing import Any, Dict, List
+import pyarrow as pa
import pytest
from pyiceberg import schema
@@ -1579,3 +1580,23 @@ def test_append_nested_lists() -> None:
)
assert union.as_struct() == expected.as_struct()
+
+
+def test_union_with_pa_schema(primitive_fields: NestedField) -> None:
+ base_schema = Schema(NestedField(field_id=1, name="foo",
field_type=StringType(), required=True))
+
+ pa_schema = pa.schema([
+ pa.field("foo", pa.string(), nullable=False),
+ pa.field("bar", pa.int32(), nullable=True),
+ pa.field("baz", pa.bool_(), nullable=True),
+ ])
+
+ new_schema = UpdateSchema(None,
schema=base_schema).union_by_name(pa_schema)._apply()
+
+ expected_schema = Schema(
+ NestedField(field_id=1, name="foo", field_type=StringType(),
required=True),
+ NestedField(field_id=2, name="bar", field_type=IntegerType(),
required=False),
+ NestedField(field_id=3, name="baz", field_type=BooleanType(),
required=False),
+ )
+
+ assert new_schema == expected_schema