This is an automated email from the ASF dual-hosted git repository. jiayu pushed a commit to branch branch-1.7.0 in repository https://gitbox.apache.org/repos/asf/sedona.git
commit 53b5d410d180ebd7426c9f96d36e4c73c8f02184 Author: Matthew Powers <[email protected]> AuthorDate: Thu Feb 27 16:44:37 2025 -0500 [DOCS] add geopackage docs (#1835) --- docs/tutorial/files/geopackage-sedona-spark.md | 198 +++++++++++++++++++++++++ docs/tutorial/sql.md | 64 +------- mkdocs.yml | 1 + 3 files changed, 201 insertions(+), 62 deletions(-) diff --git a/docs/tutorial/files/geopackage-sedona-spark.md b/docs/tutorial/files/geopackage-sedona-spark.md new file mode 100644 index 0000000000..aeeb94c5c0 --- /dev/null +++ b/docs/tutorial/files/geopackage-sedona-spark.md @@ -0,0 +1,198 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + --> + +# Apache Sedona GeoPackage with Spark + +This page shows how to read GeoPackage files with Apache Sedona and Spark. + +You’ll learn about the advantages and disadvantages of the GeoPackage file format and how to use them in production settings. + +Let’s start by creating a GeoPackage file and then reading it. + +## Reading a GeoPackage file with Sedona and Spark + +Let’s create a GeoPackage file with a few rows of data. + +Start by creating a GeoPandas DataFrame: + +```python +point1 = Point(0, 0) +point2 = Point(1, 1) +polygon1 = Polygon([(5, 5), (6, 6), (7, 5), (6, 4)]) + +data = { + "name": ["Point A", "Point B", "Polygon A"], + "value": [10, 20, 30], + "geometry": [point1, point2, polygon1], +} +gdf = gpd.GeoDataFrame(data, geometry="geometry") +``` + +Now write the GeoPandas DataFrame to a GeoPackage file: + +```python +gdf.to_file("/tmp/my_file.gpkg", layer="my_layer", driver="GPKG") +``` + +GeoPandas knows to write this to a GeoPackage file because the code sets the driver to `GPKG`. + +You can think of the layer as the table name. + +Now let’s read the GeoPackage file Apache Sedona and Spark: + +```python +df = ( + sedona.read.format("geopackage") + .option("tableName", "my_layer") + .load("/tmp/my_file.gpkg") +) +df.show() +``` + +Here are the contents of the DataFrame: + +``` ++---+--------------------+---------+-----+ +|fid| geom| name|value| ++---+--------------------+---------+-----+ +| 1| POINT (0 0)| Point A| 10| +| 2| POINT (1 1)| Point B| 20| +| 3|POLYGON ((5 5, 6 ...|Polygon A| 30| ++---+--------------------+---------+-----+ +``` + +The geometry column can contain many different geometric objects like points, polygons, and many more. + +You can also see the metadata of the GeoPackage file: + +```python +df = ( + sedona.read.format("geopackage") + .option("showMetadata", "true") + .load("/tmp/my_file.gpkg") +) +df.show() +``` + +Here are the contents: + +``` ++----------+---------+----------+-----------+--------------------+-----+-----+-----+-----+------+ +|table_name|data_type|identifier|description| last_change|min_x|min_y|max_x|max_y|srs_id| ++----------+---------+----------+-----------+--------------------+-----+-----+-----+-----+------+ +| my_layer| features| my_layer| |2025-02-25 06:28:...| 0.0| 0.0| 7.0| 6.0| 99999| ++----------+---------+----------+-----------+--------------------+-----+-----+-----+-----+------+ +``` + +## Reading many GeoPackage files with Sedona and Spark + +You can also read many GeoPackage files with Sedona. Suppose you have the following GeoPackage files: + +``` +gpkgs/ + my_file1.gpkg + my_file2.gpkg +``` + +Here’s how you can read all the files: + +```python +df = ( + sedona.read.format("geopackage") + .option("tableName", "my_layer") + .load("/tmp/gpkgs") +) +df.show() +``` + +Here are the results: + +``` ++---+--------------------+---------+-----+ +|fid| geom| name|value| ++---+--------------------+---------+-----+ +| 1| POINT (5 5)| Point C| 30| +| 2|POLYGON ((5 5, 6 ...|Polygon A| 40| +| 1| POINT (0 0)| Point A| 10| +| 2| POINT (1 1)| Point B| 20| ++---+--------------------+---------+-----+ +``` + +You just need to supply the directory containing the GeoPackage files, and Sedona can read all of them into a DataFrame. + +Sedona is an excellent option for analyzing many GeoPackage files because it can read and process them in parallel. + +## Load raster data stored in GeoPackage files + +You can also load data from raster tables in the GeoPackage file. To load raster data, you can use the following code. + +```python +df = sedona.read.format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage") +``` + +Here are the contents of the DataFrame: + +``` ++---+----------+-----------+--------+--------------------+ +| id|zoom_level|tile_column|tile_row| tile_data| ++---+----------+-----------+--------+--------------------+ +| 1| 11| 428| 778|GridCoverage2D["c...| +| 2| 11| 429| 778|GridCoverage2D["c...| +| 3| 11| 428| 779|GridCoverage2D["c...| +| 4| 11| 429| 779|GridCoverage2D["c...| +| 5| 11| 427| 777|GridCoverage2D["c...| ++---+----------+-----------+--------+--------------------+ +``` + +Known limitations (v1.7.0): + +* webp rasters are not supported +* ewkb geometries are not supported +* filtering based on geometries envelopes are not supported + +All points above should be resolved soon; stay tuned! + +## Advantages of the GeoPackage file format + +The GeoPackage file format has many advantages: + +* Any engine can support GeoPackage because it’s an open format. +* It’s mutable, unlike many other formats. +* It saves CRS information, unlike some other formats. +* It can store spatial and raster data. +* It can be read by many engines like GeoPandas, Sedona, and SQLite, of course. + +However, the GeoPackage format also has many downsides. + +## Disadvantages of GeoPackage + +The GeoPackage file format has the following disadvantages: + +* It’s row-oriented, so it can’t take advantage of column pruning like columnar file formats. +* It does not support multi-engine concurrency transactions. +* SQLite transactions are supported, but building reliable transactions with other engines would be hard. +* All engines do not fully support it. + +## Conclusion + +GeoPackage is a solid file format if you’re using SQLite. + +It’s excellent that Sedona can read GeoPackage files created by SQLite analyses. This allows you to read GeoPackage files in parallel and analyze massive datasets. You can also run Sedona on a cluster. + +If you don’t already use GeoPackage, you should probably use file formats like GeoParquet or Iceberg. diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md index 828d1dd936..aa894bce5d 100644 --- a/docs/tutorial/sql.md +++ b/docs/tutorial/sql.md @@ -658,7 +658,7 @@ For Postgis there is no need to add a query to convert geometry types since it's .withColumn("geom", f.expr("ST_GeomFromWKB(geom)"))) ``` -## Load from geopackage +## Load from GeoPackage Since v1.7.0, Sedona supports loading Geopackage file format as a DataFrame. @@ -680,67 +680,7 @@ Since v1.7.0, Sedona supports loading Geopackage file format as a DataFrame. df = sedona.read.format("geopackage").option("tableName", "tab").load("/path/to/geopackage") ``` -Geopackage files can contain vector data and raster data. To show the possible options from a file you can -look into the metadata table by adding parameter showMetadata and set its value as true. - -=== "Scala/Java" - - ```scala - val df = sedona.read.format("geopackage").option("showMetadata", "true").load("/path/to/geopackage") - ``` - -=== "Java" - - ```java - Dataset<Row> df = sedona.read().format("geopackage").option("showMetadata", "true").load("/path/to/geopackage") - ``` - -=== "Python" - - ```python - df = sedona.read.format("geopackage").option("showMetadata", "true").load("/path/to/geopackage") - -Then you can see the metadata of the geopackage file like below. - -``` -+--------------------+---------+--------------------+-----------+--------------------+----------+-----------------+----------+----------+------+ -| table_name|data_type| identifier|description| last_change| min_x| min_y| max_x| max_y|srs_id| -+--------------------+---------+--------------------+-----------+--------------------+----------+-----------------+----------+----------+------+ -|gis_osm_water_a_f...| features|gis_osm_water_a_f...| |2024-09-30 23:07:...|-9.0257084|57.96814069999999|33.4866675|80.4291867| 4326| -+--------------------+---------+--------------------+-----------+--------------------+----------+-----------------+----------+----------+------+ -``` - -You can also load data from raster tables in the geopackage file. To load raster data, you can use the following code. - -=== "Scala/Java" - - ```scala - val df = sedona.read.format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage") - ``` - -=== "Java" - - ```java - Dataset<Row> df = sedona.read().format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage") - ``` - -=== "Python" - - ```python - df = sedona.read.format("geopackage").option("tableName", "raster_table").load("/path/to/geopackage") - ``` - -``` -+---+----------+-----------+--------+--------------------+ -| id|zoom_level|tile_column|tile_row| tile_data| -+---+----------+-----------+--------+--------------------+ -| 1| 11| 428| 778|GridCoverage2D["c...| -| 2| 11| 429| 778|GridCoverage2D["c...| -| 3| 11| 428| 779|GridCoverage2D["c...| -| 4| 11| 429| 779|GridCoverage2D["c...| -| 5| 11| 427| 777|GridCoverage2D["c...| -+---+----------+-----------+--------+--------------------+ -``` +See [this page](../files/geopackage-sedona-spark) for more information on loading GeoPackage. Known limitations (v1.7.0): diff --git a/mkdocs.yml b/mkdocs.yml index 2b15dc683a..fca59dde38 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -62,6 +62,7 @@ nav: - Work with GeoPandas and Shapely: tutorial/geopandas-shapely.md - Files: - CSV: tutorial/files/csv-geometry-sedona-spark.md + - GeoPackage: tutorial/files/geopackage-sedona-spark.md - GeoParquet: tutorial/files/geoparquet-sedona-spark.md - GeoJSON: tutorial/files/geojson-sedona-spark.md - Map visualization SQL app:
