This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/master by this push:
new cf606189d6 [DOCS] add shapefiles documentation page (#1837)
cf606189d6 is described below
commit cf606189d67aef4c12b2a5759716c68336bf23ae
Author: Matthew Powers <[email protected]>
AuthorDate: Mon Mar 3 00:49:17 2025 -0500
[DOCS] add shapefiles documentation page (#1837)
---
docs/tutorial/files/shapefiles-sedona-spark.md | 215 +++++++++++++++++++++++++
docs/tutorial/sql.md | 63 +-------
mkdocs.yml | 1 +
3 files changed, 217 insertions(+), 62 deletions(-)
diff --git a/docs/tutorial/files/shapefiles-sedona-spark.md
b/docs/tutorial/files/shapefiles-sedona-spark.md
new file mode 100644
index 0000000000..3b24349b68
--- /dev/null
+++ b/docs/tutorial/files/shapefiles-sedona-spark.md
@@ -0,0 +1,215 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Shapefiles with Apache Sedona and Spark
+
+This post explains how to read Shapefiles with Apache Sedona and Spark.
+
+A Shapefile is “an Esri vector data storage format for storing the location,
shape, and attributes of geographic features.” The Shapefile format is
proprietary, but [the spec is
open](https://www.esri.com/content/dam/esrisites/sitecore-archive/Files/Pdfs/library/whitepapers/pdfs/shapefile.pdf).
+
+Shapefiles have many limitations but are extensively used, so it’s beneficial
that they are readable by Sedona.
+
+Let’s look at how to read Shapefiles with Sedona and Spark.
+
+## Read Shapefiles with Sedona and Spark
+
+Let’s start by creating a Shapefile with GeoPandas and Shapely:
+
+```python
+import geopandas as gpd
+from shapely.geometry import Point
+
+point1 = Point(0, 0)
+point2 = Point(1, 1)
+
+data = {
+ 'name': ['Point A', 'Point B'],
+ 'value': [10, 20],
+ 'geometry': [point1, point2]
+}
+
+gdf = gpd.GeoDataFrame(data, geometry='geometry')
+gdf.to_file("/tmp/my_geodata.shp")
+```
+
+Here are the files that are output:
+
+```
+/tmp/
+ my_geodata.cpg
+ my_geodata.dbf
+ my_geodata.shp
+ my_geodata.shx
+```
+
+Shapefiles are not stored in a single file. They contain data in many
different files.
+
+Here’s how to read a Shapefile into a Sedona DataFrame powered by Spark:
+
+```python
+df = sedona.read.format("shapefile").load("/tmp/my_geodata.shp")
+df.show()
+```
+
+```
++-----------+-------+-----+
+| geometry| name|value|
++-----------+-------+-----+
+|POINT (0 0)|Point A| 10|
+|POINT (1 1)|Point B| 20|
++-----------+-------+-----+
+```
+
+You can also see the unique record number for each row in the Shapefile as
follows:
+
+```python
+df = (
+ sedona.read.format("shapefile")
+ .option("key.name", "FID")
+ .load("/tmp/my_geodata.shp")
+)
+```
+
+```
++-----------+---+-------+-----+
+| geometry|FID| name|value|
++-----------+---+-------+-----+
+|POINT (0 0)| 1|Point A| 10|
+|POINT (1 1)| 2|Point B| 20|
++-----------+---+-------+-----+
+```
+
+The name of the geometry column is geometry by default. You can change the
name of the geometry column using the `geometry.name` option. Suppose one of
the non-spatial attributes is named "geometry", `geometry.name` must be
configured to avoid conflict.
+
+```python
+df = sedona.read.format("shapefile").option("geometry.name",
"geom").load("/path/to/shapefile")
+```
+
+The character encoding of string attributes are inferred from the `.cpg` file.
If you see garbled values in string fields, you can manually specify the
correct charset using the `charset` option. For example:
+
+=== "Scala/Java"
+
+ ```scala
+ val df = sedona.read.format("shapefile").option("charset",
"UTF-8").load("/path/to/shapefile")
+ ```
+
+=== "Java"
+
+ ```java
+ Dataset<Row> df = sedona.read().format("shapefile").option("charset",
"UTF-8").load("/path/to/shapefile")
+ ```
+
+=== "Python"
+
+ ```python
+ df = sedona.read.format("shapefile").option("charset",
"UTF-8").load("/path/to/shapefile")
+ ```
+
+Let’s see how to load many Shapefiles into a Sedona DataFrame.
+
+## Load many Shapefiles with Sedona
+
+Suppose you have a directory with many Shapefiles as follows:
+
+```
+/tmp/shapefiles/
+ file1.cpg
+ file1.dbf
+ file1.shp
+ file1.shx
+ file2.cpg
+ file2.dbf
+ file2.shp
+ file2.shx
+```
+
+The directory contains two `.shp` files and other supporting files.
+
+Here’s how to load many Shapefiles into a Sedona DataFrame:
+
+```python
+df = sedona.read.format("shapefile").load("/tmp/shapefiles")
+df.show()
+```
+
+```
++-----------+-------+-----+
+| geometry| name|value|
++-----------+-------+-----+
+|POINT (0 0)|Point A| 10|
+|POINT (1 1)|Point B| 20|
+|POINT (2 2)|Point C| 10|
+|POINT (3 3)|Point D| 20|
++-----------+-------+-----+
+```
+
+You can just pass the directory where the Shapefiles are stored, and the
Sedona reader will pick them up.
+
+The input path can be a directory containing one or multiple Shapefiles or a
path to a `.shp` file.
+
+* All shapefiles directly under the directory will be loaded when the input
path is a directory. If you want to load all shapefiles in subdirectories,
please specify `.option("recursiveFileLookup", "true")`.
+* The shapefile will be loaded when the input path is a .shp file. Sedona will
look for sibling files (.dbf, .shx, etc.) with the same main file name and load
them automatically.
+
+## Advantages of Shapefiles
+
+Shapefiles are deeply integrated into the Esri ecosystem and extensively used
in many services.
+
+You can output a Shapefile from Esri and then read it with another engine like
Sedona.
+
+However, Esri created the Shapefile format in the early 1990s, so it has many
limitations.
+
+## Limitations of Shapefiles
+
+Here are some of the disadvantages of Shapefiles:
+
+* Don’t support complex geometries
+* They don’t support NULL values
+* They round numbers
+* Bad Unicode support
+* Don’t allow for long field names
+* 2GB file size limit
+* Spatial indexes are slower compared to alternatives
+* Unable to store datetimes
+
+See this page for more information on [the limitations of
Shapefiles](http://switchfromshapefile.org/).
+
+Due to these limitations, other options are worth investigating.
+
+## Shapefile alternatives
+
+There are a variety of other file formats that are good for geometric data:
+
+* Iceberg
+* [GeoParquet](../geoparquet-sedona-spark)
+* FlatGeoBuf
+* [GeoPackage](../geopackage-sedona-spark)
+* [GeoJSON](../geojson-sedona-spark)
+* [CSV](../csv-geometry-sedona-spark)
+* GeoTIFF
+
+## Why Sedona does not support Shapefile writes
+
+Sedona does not write Shapefiles for two main reasons:
+
+1. Each Shapefile is a collection of files, which is hard for distributed
systems to write.
+2. A Shapefile has a hard 2 GB size limit, which isn’t large enough for some
spatial data.
+
+## Conclusion
+
+Shapefiles are a legacy file format still used in many production
applications. However, they have many limitations and aren’t the best option in
a modern data pipeline unless you need compatibility with legacy systems.
diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
index bde62529c5..4ea1ff0754 100644
--- a/docs/tutorial/sql.md
+++ b/docs/tutorial/sql.md
@@ -502,68 +502,7 @@ Since v`1.7.0`, Sedona supports loading Shapefile as a
DataFrame.
The input path can be a directory containing one or multiple shapefiles, or
path to a `.shp` file.
-- When the input path is a directory, all shapefiles directly under the
directory will be loaded. If you want to load all shapefiles in subdirectories,
please specify `.option("recursiveFileLookup", "true")`.
-- When the input path is a `.shp` file, that shapefile will be loaded. Sedona
will look for sibling files (`.dbf`, `.shx`, etc.) with the same main file name
and load them automatically.
-
-The name of the geometry column is `geometry` by default. You can change the
name of the geometry column using the `geometry.name` option. If one of the
non-spatial attributes is named "geometry", `geometry.name` must be configured
to avoid conflict.
-
-=== "Scala/Java"
-
- ```scala
- val df = sedona.read.format("shapefile").option("geometry.name",
"geom").load("/path/to/shapefile")
- ```
-
-=== "Java"
-
- ```java
- Dataset<Row> df =
sedona.read().format("shapefile").option("geometry.name",
"geom").load("/path/to/shapefile")
- ```
-
-=== "Python"
-
- ```python
- df = sedona.read.format("shapefile").option("geometry.name",
"geom").load("/path/to/shapefile")
- ```
-
-Each record in shapefile has a unique record number, that record number is not
loaded by default. If you want to include record number in the loaded
DataFrame, you can set the `key.name` option to the name of the record number
column:
-
-=== "Scala/Java"
-
- ```scala
- val df = sedona.read.format("shapefile").option("key.name",
"FID").load("/path/to/shapefile")
- ```
-
-=== "Java"
-
- ```java
- Dataset<Row> df = sedona.read().format("shapefile").option("key.name",
"FID").load("/path/to/shapefile")
- ```
-
-=== "Python"
-
- ```python
- df = sedona.read.format("shapefile").option("key.name",
"FID").load("/path/to/shapefile")
- ```
-
-The character encoding of string attributes are inferred from the `.cpg` file.
If you see garbled values in string fields, you can manually specify the
correct charset using the `charset` option. For example:
-
-=== "Scala/Java"
-
- ```scala
- val df = sedona.read.format("shapefile").option("charset",
"UTF-8").load("/path/to/shapefile")
- ```
-
-=== "Java"
-
- ```java
- Dataset<Row> df = sedona.read().format("shapefile").option("charset",
"UTF-8").load("/path/to/shapefile")
- ```
-
-=== "Python"
-
- ```python
- df = sedona.read.format("shapefile").option("charset",
"UTF-8").load("/path/to/shapefile")
- ```
+See [this page](../files/shapefile-sedona-spark) for more information on
loading Shapefiles.
## Load GeoParquet
diff --git a/mkdocs.yml b/mkdocs.yml
index 01b838ce4a..eb478fc515 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -65,6 +65,7 @@ nav:
- GeoPackage: tutorial/files/geopackage-sedona-spark.md
- GeoParquet: tutorial/files/geoparquet-sedona-spark.md
- GeoJSON: tutorial/files/geojson-sedona-spark.md
+ - Shapefiles: tutorial/files/shapefiles-sedona-spark.md
- Map visualization SQL app:
- Scala/Java: tutorial/viz.md
- Use Apache Zeppelin: tutorial/zeppelin.md