paleolimbot commented on code in PR #417: URL: https://github.com/apache/sedona-db/pull/417#discussion_r2599162661
########## docs/iceberg.md: ########## @@ -0,0 +1,326 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# SedonaDB + Iceberg + +This page demonstrates how to store spatial data in Iceberg tables and how to query them with SedonaDB. + +You will learn how to create an Iceberg table with a well-known text (WKT) or well-known binary (WKB) column in an Iceberg table and some of the advantages of storing geometric data in Iceberg. + +Make sure to run `pip install pyiceberg` to install the required dependencies for this notebook. + +Let’s start by loading the required dependencies and saving a spatial dataset in an Iceberg table. + + +```python +from pyiceberg.catalog import load_catalog +import pyarrow.compute as pc +import sedona.db +``` + +## Create an Iceberg table with geometric data + +Start by creating the Iceberg warehouse: + + +```python +!mkdir /tmp/warehouse +``` + + mkdir: /tmp/warehouse: File exists Review Comment: For the purposes of rendering this notebook nicely you could use `import tempfile` + `tempfile.TemporaryDirectory()` or `pathlib.Path("/tmp/warehouse").mkdir(exists_ok=True)`. ########## docs/iceberg.md: ########## @@ -0,0 +1,326 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# SedonaDB + Iceberg + +This page demonstrates how to store spatial data in Iceberg tables and how to query them with SedonaDB. + +You will learn how to create an Iceberg table with a well-known text (WKT) or well-known binary (WKB) column in an Iceberg table and some of the advantages of storing geometric data in Iceberg. + +Make sure to run `pip install pyiceberg` to install the required dependencies for this notebook. + +Let’s start by loading the required dependencies and saving a spatial dataset in an Iceberg table. + + +```python +from pyiceberg.catalog import load_catalog +import pyarrow.compute as pc +import sedona.db +``` + +## Create an Iceberg table with geometric data + +Start by creating the Iceberg warehouse: + + +```python +!mkdir /tmp/warehouse +``` + + mkdir: /tmp/warehouse: File exists + + +Now set up the catalog: + + +```python +warehouse_path = "/tmp/warehouse" +catalog = load_catalog( + "default", + **{ + 'type': 'sql', + "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", + "warehouse": f"file://{warehouse_path}", + }, +) +``` + +Use SedonaDB to read a Parquet file containing country data into a DataFrame. + + +```python +sd = sedona.db.connect() + +countries = sd.read_parquet( + "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries_geo.parquet" +) +``` + +Convert all the columns to be plain strings because Iceberg doesn’t support geometry columns yet: + + +```python +countries.to_view("countries", True) +df = sd.sql(""" + select + ARROW_CAST(name, 'Utf8') as name, + ARROW_CAST(continent, 'Utf8') as continent, + ST_AsText(geometry) as geometry_wkt + from countries +""") +``` + +Check out the schema of the DataFrame: + + +```python +df.schema +``` + + + + + SedonaSchema with 3 fields: + name: utf8<Utf8> + continent: utf8<Utf8> + geometry_wkt: utf8<Utf8> + + + +Now create a new Iceberg table: + + +```python +from pyiceberg.exceptions import NamespaceAlreadyExistsError +try: + catalog.create_namespace("default") +except NamespaceAlreadyExistsError: + pass + +if catalog.table_exists("default.countries"): + catalog.drop_table("default.countries") + +table = catalog.create_table( + "default.countries", + schema=df.to_arrow_table().schema, Review Comment: ```suggestion schema=pa.schema(df.schema) ``` (avoids materializing the entire table) ########## docs/iceberg.md: ########## @@ -0,0 +1,326 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# SedonaDB + Iceberg + +This page demonstrates how to store spatial data in Iceberg tables and how to query them with SedonaDB. + +You will learn how to create an Iceberg table with a well-known text (WKT) or well-known binary (WKB) column in an Iceberg table and some of the advantages of storing geometric data in Iceberg. + +Make sure to run `pip install pyiceberg` to install the required dependencies for this notebook. + +Let’s start by loading the required dependencies and saving a spatial dataset in an Iceberg table. + + +```python +from pyiceberg.catalog import load_catalog +import pyarrow.compute as pc +import sedona.db +``` + +## Create an Iceberg table with geometric data + +Start by creating the Iceberg warehouse: + + +```python +!mkdir /tmp/warehouse +``` + + mkdir: /tmp/warehouse: File exists + + +Now set up the catalog: + + +```python +warehouse_path = "/tmp/warehouse" +catalog = load_catalog( + "default", + **{ + 'type': 'sql', + "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", + "warehouse": f"file://{warehouse_path}", + }, +) +``` + +Use SedonaDB to read a Parquet file containing country data into a DataFrame. + + +```python +sd = sedona.db.connect() + +countries = sd.read_parquet( + "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries_geo.parquet" +) +``` + +Convert all the columns to be plain strings because Iceberg doesn’t support geometry columns yet: + + +```python +countries.to_view("countries", True) +df = sd.sql(""" + select + ARROW_CAST(name, 'Utf8') as name, + ARROW_CAST(continent, 'Utf8') as continent, + ST_AsText(geometry) as geometry_wkt + from countries +""") +``` + +Check out the schema of the DataFrame: + + +```python +df.schema +``` + + + + + SedonaSchema with 3 fields: + name: utf8<Utf8> + continent: utf8<Utf8> + geometry_wkt: utf8<Utf8> + + + +Now create a new Iceberg table: + + +```python +from pyiceberg.exceptions import NamespaceAlreadyExistsError +try: + catalog.create_namespace("default") +except NamespaceAlreadyExistsError: + pass + +if catalog.table_exists("default.countries"): + catalog.drop_table("default.countries") + +table = catalog.create_table( + "default.countries", + schema=df.to_arrow_table().schema, +) +``` + +Append the DataFrame to the table: + + +```python +table.append(df.to_arrow_table()) +``` + +Now let’s see how to read the data with SedonaDB. + +## Read the Iceberg table into SedonaDB via Arrow + +Here’s how to read an Iceberg table into a SedonaDB DataFrame: + + +```python +table = catalog.load_table("default.countries") +arrow_table = table.scan().to_arrow() +df = sd.create_data_frame(arrow_table) +``` + +The Iceberg table is first exposed as an arrow table and then read into a SedonaDB DataFrame. + +Now view the contents of the SedonaDB DataFrame: + + +```python +df.to_view("my_table", True) +res = sd.sql(""" +SELECT + name, + continent, + ST_GeomFromWKT(geometry_wkt) as geom +from my_table +""") +res.show(3) +``` + + ┌─────────────────────────────┬───────────┬────────────────────────────────────────────────────────┐ + │ name ┆ continent ┆ geom │ + │ utf8 ┆ utf8 ┆ geometry │ + ╞═════════════════════════════╪═══════════╪════════════════════════════════════════════════════════╡ + │ Fiji ┆ Oceania ┆ MULTIPOLYGON(((180 -16.067132663642447,180 -16.555216… │ + ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ + │ United Republic of Tanzania ┆ Africa ┆ POLYGON((33.90371119710453 -0.9500000000000001,34.072… │ + ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ + │ Western Sahara ┆ Africa ┆ POLYGON((-8.665589565454809 27.656425889592356,-8.665… │ + └─────────────────────────────┴───────────┴────────────────────────────────────────────────────────┘ + + +You can see that the geom column contains the geometry type, which enables spatial analysis of the data. +Future Iceberg geography/geometry work +Iceberg added support for geometry and geography columns in the v3 spec. + +The Iceberg v3 implementation has not been released yet, and it the v3 spec hasn't started in Iceberg Rust. Here is the open issue to add geo support to Iceberg Rust. Review Comment: Are you missing a link here? ########## docs/iceberg.md: ########## @@ -0,0 +1,326 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# SedonaDB + Iceberg + +This page demonstrates how to store spatial data in Iceberg tables and how to query them with SedonaDB. + +You will learn how to create an Iceberg table with a well-known text (WKT) or well-known binary (WKB) column in an Iceberg table and some of the advantages of storing geometric data in Iceberg. + +Make sure to run `pip install pyiceberg` to install the required dependencies for this notebook. + +Let’s start by loading the required dependencies and saving a spatial dataset in an Iceberg table. + + +```python +from pyiceberg.catalog import load_catalog +import pyarrow.compute as pc +import sedona.db +``` + +## Create an Iceberg table with geometric data + +Start by creating the Iceberg warehouse: + + +```python +!mkdir /tmp/warehouse +``` + + mkdir: /tmp/warehouse: File exists + + +Now set up the catalog: + + +```python +warehouse_path = "/tmp/warehouse" +catalog = load_catalog( + "default", + **{ + 'type': 'sql', + "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", + "warehouse": f"file://{warehouse_path}", + }, +) +``` + +Use SedonaDB to read a Parquet file containing country data into a DataFrame. + + +```python +sd = sedona.db.connect() + +countries = sd.read_parquet( + "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries_geo.parquet" +) +``` + +Convert all the columns to be plain strings because Iceberg doesn’t support geometry columns yet: + + +```python +countries.to_view("countries", True) +df = sd.sql(""" + select + ARROW_CAST(name, 'Utf8') as name, + ARROW_CAST(continent, 'Utf8') as continent, + ST_AsText(geometry) as geometry_wkt + from countries +""") +``` + +Check out the schema of the DataFrame: + + +```python +df.schema +``` + + + + + SedonaSchema with 3 fields: + name: utf8<Utf8> + continent: utf8<Utf8> + geometry_wkt: utf8<Utf8> + + + +Now create a new Iceberg table: + + +```python +from pyiceberg.exceptions import NamespaceAlreadyExistsError +try: + catalog.create_namespace("default") +except NamespaceAlreadyExistsError: + pass + +if catalog.table_exists("default.countries"): + catalog.drop_table("default.countries") + +table = catalog.create_table( + "default.countries", + schema=df.to_arrow_table().schema, +) +``` + +Append the DataFrame to the table: + + +```python +table.append(df.to_arrow_table()) +``` + +Now let’s see how to read the data with SedonaDB. + +## Read the Iceberg table into SedonaDB via Arrow + +Here’s how to read an Iceberg table into a SedonaDB DataFrame: + + +```python +table = catalog.load_table("default.countries") +arrow_table = table.scan().to_arrow() +df = sd.create_data_frame(arrow_table) +``` + +The Iceberg table is first exposed as an arrow table and then read into a SedonaDB DataFrame. + +Now view the contents of the SedonaDB DataFrame: + + +```python +df.to_view("my_table", True) +res = sd.sql(""" +SELECT + name, + continent, + ST_GeomFromWKT(geometry_wkt) as geom +from my_table +""") +res.show(3) +``` + + ┌─────────────────────────────┬───────────┬────────────────────────────────────────────────────────┐ + │ name ┆ continent ┆ geom │ + │ utf8 ┆ utf8 ┆ geometry │ + ╞═════════════════════════════╪═══════════╪════════════════════════════════════════════════════════╡ + │ Fiji ┆ Oceania ┆ MULTIPOLYGON(((180 -16.067132663642447,180 -16.555216… │ + ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ + │ United Republic of Tanzania ┆ Africa ┆ POLYGON((33.90371119710453 -0.9500000000000001,34.072… │ + ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ + │ Western Sahara ┆ Africa ┆ POLYGON((-8.665589565454809 27.656425889592356,-8.665… │ + └─────────────────────────────┴───────────┴────────────────────────────────────────────────────────┘ + + +You can see that the geom column contains the geometry type, which enables spatial analysis of the data. +Future Iceberg geography/geometry work +Iceberg added support for geometry and geography columns in the v3 spec. + +The Iceberg v3 implementation has not been released yet, and it the v3 spec hasn't started in Iceberg Rust. Here is the open issue to add geo support to Iceberg Rust. + +It’s best to manually persist the bbox information of files in your Iceberg table if you’re storing geometric data as WKT or WKB. + +## Create an Iceberg table with WKB and bbox + +Let’s see how to create an Iceberg table with a WKB and bbox columns to allow for faster spatial analyses. + +Start by creating the cities DataFrame. + + +```python +cities = sd.read_parquet( + "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities_geo.parquet" +) +cities.to_view("cities", True) +``` + +Now write the DataFrame to an Iceberg table with bbox columns: + + +```python +df = sd.sql(""" +select + ARROW_CAST(name, 'Utf8') as name, + ARROW_CAST(ST_AsBinary(geometry), 'Binary') as geometry_wkb, Review Comment: It might be worth a comment here to note that this is required because PyIceberg doesn't support string views (if it does, you should be able to remove this) ########## docs/iceberg.md: ########## @@ -0,0 +1,326 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# SedonaDB + Iceberg + +This page demonstrates how to store spatial data in Iceberg tables and how to query them with SedonaDB. + +You will learn how to create an Iceberg table with a well-known text (WKT) or well-known binary (WKB) column in an Iceberg table and some of the advantages of storing geometric data in Iceberg. + +Make sure to run `pip install pyiceberg` to install the required dependencies for this notebook. + +Let’s start by loading the required dependencies and saving a spatial dataset in an Iceberg table. + + +```python +from pyiceberg.catalog import load_catalog +import pyarrow.compute as pc +import sedona.db +``` + +## Create an Iceberg table with geometric data + +Start by creating the Iceberg warehouse: + + +```python +!mkdir /tmp/warehouse +``` + + mkdir: /tmp/warehouse: File exists + + +Now set up the catalog: + + +```python +warehouse_path = "/tmp/warehouse" +catalog = load_catalog( + "default", + **{ + 'type': 'sql', + "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", + "warehouse": f"file://{warehouse_path}", + }, +) +``` + +Use SedonaDB to read a Parquet file containing country data into a DataFrame. + + +```python +sd = sedona.db.connect() + +countries = sd.read_parquet( + "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries_geo.parquet" +) +``` + +Convert all the columns to be plain strings because Iceberg doesn’t support geometry columns yet: + + +```python +countries.to_view("countries", True) +df = sd.sql(""" + select + ARROW_CAST(name, 'Utf8') as name, + ARROW_CAST(continent, 'Utf8') as continent, + ST_AsText(geometry) as geometry_wkt + from countries +""") +``` + +Check out the schema of the DataFrame: + + +```python +df.schema +``` + + + + + SedonaSchema with 3 fields: + name: utf8<Utf8> + continent: utf8<Utf8> + geometry_wkt: utf8<Utf8> + + + +Now create a new Iceberg table: + + +```python +from pyiceberg.exceptions import NamespaceAlreadyExistsError +try: + catalog.create_namespace("default") +except NamespaceAlreadyExistsError: + pass + +if catalog.table_exists("default.countries"): + catalog.drop_table("default.countries") + +table = catalog.create_table( + "default.countries", + schema=df.to_arrow_table().schema, +) +``` + +Append the DataFrame to the table: + + +```python +table.append(df.to_arrow_table()) +``` + +Now let’s see how to read the data with SedonaDB. + +## Read the Iceberg table into SedonaDB via Arrow + +Here’s how to read an Iceberg table into a SedonaDB DataFrame: + + +```python +table = catalog.load_table("default.countries") +arrow_table = table.scan().to_arrow() +df = sd.create_data_frame(arrow_table) +``` + +The Iceberg table is first exposed as an arrow table and then read into a SedonaDB DataFrame. + +Now view the contents of the SedonaDB DataFrame: + + +```python +df.to_view("my_table", True) +res = sd.sql(""" +SELECT + name, + continent, + ST_GeomFromWKT(geometry_wkt) as geom +from my_table +""") +res.show(3) +``` + + ┌─────────────────────────────┬───────────┬────────────────────────────────────────────────────────┐ + │ name ┆ continent ┆ geom │ + │ utf8 ┆ utf8 ┆ geometry │ + ╞═════════════════════════════╪═══════════╪════════════════════════════════════════════════════════╡ + │ Fiji ┆ Oceania ┆ MULTIPOLYGON(((180 -16.067132663642447,180 -16.555216… │ + ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ + │ United Republic of Tanzania ┆ Africa ┆ POLYGON((33.90371119710453 -0.9500000000000001,34.072… │ + ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ + │ Western Sahara ┆ Africa ┆ POLYGON((-8.665589565454809 27.656425889592356,-8.665… │ + └─────────────────────────────┴───────────┴────────────────────────────────────────────────────────┘ + + +You can see that the geom column contains the geometry type, which enables spatial analysis of the data. +Future Iceberg geography/geometry work +Iceberg added support for geometry and geography columns in the v3 spec. + +The Iceberg v3 implementation has not been released yet, and it the v3 spec hasn't started in Iceberg Rust. Here is the open issue to add geo support to Iceberg Rust. + +It’s best to manually persist the bbox information of files in your Iceberg table if you’re storing geometric data as WKT or WKB. + +## Create an Iceberg table with WKB and bbox + +Let’s see how to create an Iceberg table with a WKB and bbox columns to allow for faster spatial analyses. + +Start by creating the cities DataFrame. + + +```python +cities = sd.read_parquet( + "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_cities_geo.parquet" +) +cities.to_view("cities", True) +``` + +Now write the DataFrame to an Iceberg table with bbox columns: + + +```python +df = sd.sql(""" +select + ARROW_CAST(name, 'Utf8') as name, + ARROW_CAST(ST_AsBinary(geometry), 'Binary') as geometry_wkb, + ST_XMin(geometry) as bbox_xmin, + ST_YMin(geometry) as bbox_ymin, + ST_XMax(geometry) as bbox_xmax, + ST_YMax(geometry) as bbox_ymax +from cities +""") +``` + + +```python +if catalog.table_exists("default.cities"): + catalog.drop_table("default.cities") + +table = catalog.create_table( + "default.cities", + schema=df.to_arrow_table().schema, Review Comment: ```suggestion schema=pa.schema(df.schema) ``` ########## docs/iceberg.md: ########## @@ -0,0 +1,326 @@ +<!--- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# SedonaDB + Iceberg + +This page demonstrates how to store spatial data in Iceberg tables and how to query them with SedonaDB. + +You will learn how to create an Iceberg table with a well-known text (WKT) or well-known binary (WKB) column in an Iceberg table and some of the advantages of storing geometric data in Iceberg. + +Make sure to run `pip install pyiceberg` to install the required dependencies for this notebook. + +Let’s start by loading the required dependencies and saving a spatial dataset in an Iceberg table. + + +```python +from pyiceberg.catalog import load_catalog +import pyarrow.compute as pc +import sedona.db +``` + +## Create an Iceberg table with geometric data + +Start by creating the Iceberg warehouse: + + +```python +!mkdir /tmp/warehouse +``` + + mkdir: /tmp/warehouse: File exists + + +Now set up the catalog: + + +```python +warehouse_path = "/tmp/warehouse" +catalog = load_catalog( + "default", + **{ + 'type': 'sql', + "uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", + "warehouse": f"file://{warehouse_path}", + }, +) +``` + +Use SedonaDB to read a Parquet file containing country data into a DataFrame. + + +```python +sd = sedona.db.connect() + +countries = sd.read_parquet( + "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/natural-earth/files/natural-earth_countries_geo.parquet" +) +``` + +Convert all the columns to be plain strings because Iceberg doesn’t support geometry columns yet: + + +```python +countries.to_view("countries", True) +df = sd.sql(""" + select + ARROW_CAST(name, 'Utf8') as name, + ARROW_CAST(continent, 'Utf8') as continent, + ST_AsText(geometry) as geometry_wkt + from countries +""") +``` + +Check out the schema of the DataFrame: + + +```python +df.schema +``` + + + + + SedonaSchema with 3 fields: + name: utf8<Utf8> + continent: utf8<Utf8> + geometry_wkt: utf8<Utf8> + + + +Now create a new Iceberg table: + + +```python +from pyiceberg.exceptions import NamespaceAlreadyExistsError +try: + catalog.create_namespace("default") +except NamespaceAlreadyExistsError: + pass + +if catalog.table_exists("default.countries"): + catalog.drop_table("default.countries") + +table = catalog.create_table( + "default.countries", + schema=df.to_arrow_table().schema, +) +``` + +Append the DataFrame to the table: + + +```python +table.append(df.to_arrow_table()) +``` + +Now let’s see how to read the data with SedonaDB. + +## Read the Iceberg table into SedonaDB via Arrow + +Here’s how to read an Iceberg table into a SedonaDB DataFrame: + + +```python +table = catalog.load_table("default.countries") +arrow_table = table.scan().to_arrow() +df = sd.create_data_frame(arrow_table) +``` + +The Iceberg table is first exposed as an arrow table and then read into a SedonaDB DataFrame. + +Now view the contents of the SedonaDB DataFrame: + + +```python +df.to_view("my_table", True) +res = sd.sql(""" +SELECT + name, + continent, + ST_GeomFromWKT(geometry_wkt) as geom Review Comment: I had this comment on the delta tutorial as well, but pushing folks to WKB instead of WKT might lead to more success (it's much faster) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
