(sedona) branch master updated: [DOCS] add shapefiles documentation page (#1837)

jiayu Sun, 02 Mar 2025 22:09:06 -0800

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git



The following commit(s) were added to refs/heads/master by this push:
     new cf606189d6 [DOCS] add shapefiles documentation page (#1837)
cf606189d6 is described below

commit cf606189d67aef4c12b2a5759716c68336bf23ae
Author: Matthew Powers <[email protected]>
AuthorDate: Mon Mar 3 00:49:17 2025 -0500

    [DOCS] add shapefiles documentation page (#1837)
---
 docs/tutorial/files/shapefiles-sedona-spark.md | 215 +++++++++++++++++++++++++
 docs/tutorial/sql.md                           |  63 +-------
 mkdocs.yml                                     |   1 +
 3 files changed, 217 insertions(+), 62 deletions(-)

diff --git a/docs/tutorial/files/shapefiles-sedona-spark.md 
b/docs/tutorial/files/shapefiles-sedona-spark.md
new file mode 100644
index 0000000000..3b24349b68
--- /dev/null
+++ b/docs/tutorial/files/shapefiles-sedona-spark.md
@@ -0,0 +1,215 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Shapefiles with Apache Sedona and Spark
+
+This post explains how to read Shapefiles with Apache Sedona and Spark.
+
+A Shapefile is “an Esri vector data storage format for storing the location, 
shape, and attributes of geographic features.”  The Shapefile format is 
proprietary, but [the spec is 
open](https://www.esri.com/content/dam/esrisites/sitecore-archive/Files/Pdfs/library/whitepapers/pdfs/shapefile.pdf).
+
+Shapefiles have many limitations but are extensively used, so it’s beneficial 
that they are readable by Sedona.
+
+Let’s look at how to read Shapefiles with Sedona and Spark.
+
+## Read Shapefiles with Sedona and Spark
+
+Let’s start by creating a Shapefile with GeoPandas and Shapely:
+
+```python
+import geopandas as gpd
+from shapely.geometry import Point
+
+point1 = Point(0, 0)
+point2 = Point(1, 1)
+
+data = {
+    'name': ['Point A', 'Point B'],
+    'value': [10, 20],
+    'geometry': [point1, point2]
+}
+
+gdf = gpd.GeoDataFrame(data, geometry='geometry')
+gdf.to_file("/tmp/my_geodata.shp")
+```
+
+Here are the files that are output:
+
+```
+/tmp/
+  my_geodata.cpg
+  my_geodata.dbf
+  my_geodata.shp
+  my_geodata.shx
+```
+
+Shapefiles are not stored in a single file.  They contain data in many 
different files.
+
+Here’s how to read a Shapefile into a Sedona DataFrame powered by Spark:
+
+```python
+df = sedona.read.format("shapefile").load("/tmp/my_geodata.shp")
+df.show()
+```
+
+```
++-----------+-------+-----+
+|   geometry|   name|value|
++-----------+-------+-----+
+|POINT (0 0)|Point A|   10|
+|POINT (1 1)|Point B|   20|
++-----------+-------+-----+
+```
+
+You can also see the unique record number for each row in the Shapefile as 
follows:
+
+```python
+df = (
+    sedona.read.format("shapefile")
+    .option("key.name", "FID")
+    .load("/tmp/my_geodata.shp")
+)
+```
+
+```
++-----------+---+-------+-----+
+|   geometry|FID|   name|value|
++-----------+---+-------+-----+
+|POINT (0 0)|  1|Point A|   10|
+|POINT (1 1)|  2|Point B|   20|
++-----------+---+-------+-----+
+```
+
+The name of the geometry column is geometry by default. You can change the 
name of the geometry column using the `geometry.name` option. Suppose one of 
the non-spatial attributes is named "geometry", `geometry.name` must be 
configured to avoid conflict.
+
+```python
+df = sedona.read.format("shapefile").option("geometry.name", 
"geom").load("/path/to/shapefile")
+```
+
+The character encoding of string attributes are inferred from the `.cpg` file. 
If you see garbled values in string fields, you can manually specify the 
correct charset using the `charset` option. For example:
+
+=== "Scala/Java"
+
+    ```scala
+    val df = sedona.read.format("shapefile").option("charset", 
"UTF-8").load("/path/to/shapefile")
+    ```
+
+=== "Java"
+
+    ```java
+    Dataset<Row> df = sedona.read().format("shapefile").option("charset", 
"UTF-8").load("/path/to/shapefile")
+    ```
+
+=== "Python"
+
+    ```python
+    df = sedona.read.format("shapefile").option("charset", 
"UTF-8").load("/path/to/shapefile")
+    ```
+
+Let’s see how to load many Shapefiles into a Sedona DataFrame.
+
+## Load many Shapefiles with Sedona
+
+Suppose you have a directory with many Shapefiles as follows:
+
+```
+/tmp/shapefiles/
+  file1.cpg
+  file1.dbf
+  file1.shp
+  file1.shx
+  file2.cpg
+  file2.dbf
+  file2.shp
+  file2.shx
+```
+
+The directory contains two `.shp` files and other supporting files.
+
+Here’s how to load many Shapefiles into a Sedona DataFrame:
+
+```python
+df = sedona.read.format("shapefile").load("/tmp/shapefiles")
+df.show()
+```
+
+```
++-----------+-------+-----+
+|   geometry|   name|value|
++-----------+-------+-----+
+|POINT (0 0)|Point A|   10|
+|POINT (1 1)|Point B|   20|
+|POINT (2 2)|Point C|   10|
+|POINT (3 3)|Point D|   20|
++-----------+-------+-----+
+```
+
+You can just pass the directory where the Shapefiles are stored, and the 
Sedona reader will pick them up.
+
+The input path can be a directory containing one or multiple Shapefiles or a 
path to a `.shp` file.
+
+* All shapefiles directly under the directory will be loaded when the input 
path is a directory. If you want to load all shapefiles in subdirectories, 
please specify `.option("recursiveFileLookup", "true")`.
+* The shapefile will be loaded when the input path is a .shp file. Sedona will 
look for sibling files (.dbf, .shx, etc.) with the same main file name and load 
them automatically.
+
+## Advantages of Shapefiles
+
+Shapefiles are deeply integrated into the Esri ecosystem and extensively used 
in many services.
+
+You can output a Shapefile from Esri and then read it with another engine like 
Sedona.
+
+However, Esri created the Shapefile format in the early 1990s, so it has many 
limitations.
+
+## Limitations of Shapefiles
+
+Here are some of the disadvantages of Shapefiles:
+
+* Don’t support complex geometries
+* They don’t support NULL values
+* They round numbers
+* Bad Unicode support
+* Don’t allow for long field names
+* 2GB file size limit
+* Spatial indexes are slower compared to alternatives
+* Unable to store datetimes
+
+See this page for more information on [the limitations of 
Shapefiles](http://switchfromshapefile.org/).
+
+Due to these limitations, other options are worth investigating.
+
+## Shapefile alternatives
+
+There are a variety of other file formats that are good for geometric data:
+
+* Iceberg
+* [GeoParquet](../geoparquet-sedona-spark)
+* FlatGeoBuf
+* [GeoPackage](../geopackage-sedona-spark)
+* [GeoJSON](../geojson-sedona-spark)
+* [CSV](../csv-geometry-sedona-spark)
+* GeoTIFF
+
+## Why Sedona does not support Shapefile writes
+
+Sedona does not write Shapefiles for two main reasons:
+
+1. Each Shapefile is a collection of files, which is hard for distributed 
systems to write.
+2. A Shapefile has a hard 2 GB size limit, which isn’t large enough for some 
spatial data.
+
+## Conclusion
+
+Shapefiles are a legacy file format still used in many production 
applications. However, they have many limitations and aren’t the best option in 
a modern data pipeline unless you need compatibility with legacy systems.
diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
index bde62529c5..4ea1ff0754 100644
--- a/docs/tutorial/sql.md
+++ b/docs/tutorial/sql.md
@@ -502,68 +502,7 @@ Since v`1.7.0`, Sedona supports loading Shapefile as a 
DataFrame.
 
 The input path can be a directory containing one or multiple shapefiles, or 
path to a `.shp` file.
 
-- When the input path is a directory, all shapefiles directly under the 
directory will be loaded. If you want to load all shapefiles in subdirectories, 
please specify `.option("recursiveFileLookup", "true")`.
-- When the input path is a `.shp` file, that shapefile will be loaded. Sedona 
will look for sibling files (`.dbf`, `.shx`, etc.) with the same main file name 
and load them automatically.
-
-The name of the geometry column is `geometry` by default. You can change the 
name of the geometry column using the `geometry.name` option. If one of the 
non-spatial attributes is named "geometry", `geometry.name` must be configured 
to avoid conflict.
-
-=== "Scala/Java"
-
-    ```scala
-    val df = sedona.read.format("shapefile").option("geometry.name", 
"geom").load("/path/to/shapefile")
-    ```
-
-=== "Java"
-
-    ```java
-    Dataset<Row> df = 
sedona.read().format("shapefile").option("geometry.name", 
"geom").load("/path/to/shapefile")
-    ```
-
-=== "Python"
-
-    ```python
-    df = sedona.read.format("shapefile").option("geometry.name", 
"geom").load("/path/to/shapefile")
-    ```
-
-Each record in shapefile has a unique record number, that record number is not 
loaded by default. If you want to include record number in the loaded 
DataFrame, you can set the `key.name` option to the name of the record number 
column:
-
-=== "Scala/Java"
-
-    ```scala
-    val df = sedona.read.format("shapefile").option("key.name", 
"FID").load("/path/to/shapefile")
-    ```
-
-=== "Java"
-
-    ```java
-    Dataset<Row> df = sedona.read().format("shapefile").option("key.name", 
"FID").load("/path/to/shapefile")
-    ```
-
-=== "Python"
-
-    ```python
-    df = sedona.read.format("shapefile").option("key.name", 
"FID").load("/path/to/shapefile")
-    ```
-
-The character encoding of string attributes are inferred from the `.cpg` file. 
If you see garbled values in string fields, you can manually specify the 
correct charset using the `charset` option. For example:
-
-=== "Scala/Java"
-
-    ```scala
-    val df = sedona.read.format("shapefile").option("charset", 
"UTF-8").load("/path/to/shapefile")
-    ```
-
-=== "Java"
-
-    ```java
-    Dataset<Row> df = sedona.read().format("shapefile").option("charset", 
"UTF-8").load("/path/to/shapefile")
-    ```
-
-=== "Python"
-
-    ```python
-    df = sedona.read.format("shapefile").option("charset", 
"UTF-8").load("/path/to/shapefile")
-    ```
+See [this page](../files/shapefile-sedona-spark) for more information on 
loading Shapefiles.
 
 ## Load GeoParquet
 
diff --git a/mkdocs.yml b/mkdocs.yml
index 01b838ce4a..eb478fc515 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -65,6 +65,7 @@ nav:
               - GeoPackage: tutorial/files/geopackage-sedona-spark.md
               - GeoParquet: tutorial/files/geoparquet-sedona-spark.md
               - GeoJSON: tutorial/files/geojson-sedona-spark.md
+              - Shapefiles: tutorial/files/shapefiles-sedona-spark.md
           - Map visualization SQL app:
               - Scala/Java: tutorial/viz.md
               - Use Apache Zeppelin: tutorial/zeppelin.md

(sedona) branch master updated: [DOCS] add shapefiles documentation page (#1837)

Reply via email to