This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/master by this push:
new 6850f43f77 [DOCS] Add geojson docs (#1814)
6850f43f77 is described below
commit 6850f43f77f4fb8954b25a5a8c2f3c6ee89b8ac1
Author: Matthew Powers <[email protected]>
AuthorDate: Sun Feb 16 17:40:11 2025 -0500
[DOCS] Add geojson docs (#1814)
* use dashes not underscores
* fix whitespace
* update based on pr comments
---
docs/tutorial/files/geojson-sedona-spark.md | 237 ++++++++++++++++++++++++++++
mkdocs.yml | 2 +
2 files changed, 239 insertions(+)
diff --git a/docs/tutorial/files/geojson-sedona-spark.md
b/docs/tutorial/files/geojson-sedona-spark.md
new file mode 100644
index 0000000000..6081981b8f
--- /dev/null
+++ b/docs/tutorial/files/geojson-sedona-spark.md
@@ -0,0 +1,237 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+ -->
+
+# Apache Sedona GeoJSON with Spark
+
+This page shows how to read/write single-line GeoJSON files and multiline
GeoJSON files with Apache Sedona and Spark.
+
+The post concludes with a summary of the benefits and drawbacks of the GeoJSON
file format for spatial analyses.
+
+GeoJSON is based on JSON and supports the following types:
+
+* Point
+* LineString
+* Polygon
+* MultiPoint
+* MultiLineString
+* MultiPolygon
+
+See here for [more details about the GeoJSON format
specification](https://datatracker.ietf.org/doc/html/rfc7946).
+
+## Read multiline GeoJSON files with Sedona and Spark
+
+Here’s how to read a multiline GeoJSON file with Sedona:
+
+```python
+df = (
+ sedona.read.format("geojson").option("multiLine",
"true").load("data/multiline_geojson.json")
+ .selectExpr("explode(features) as features")
+ .select("features.*")
+ .withColumn("prop0",
expr("properties['prop0']")).drop("properties").drop("type")
+)
+df.show(truncate=False)
+```
+
+Here’s the output:
+
+```
++---------------------------------------------+------+
+|geometry |prop0 |
++---------------------------------------------+------+
+|POINT (102 0.5) |value0|
+|LINESTRING (102 0, 103 1, 104 0, 105 1) |value1|
+|POLYGON ((100 0, 101 0, 101 1, 100 1, 100 0))|value2|
++---------------------------------------------+------+
+```
+
+The multiline GeoJSON file contains a point, a linestring, and a polygon.
Let’s inspect the content of the file:
+
+```json
+{ "type": "FeatureCollection",
+ "features": [
+ { "type": "Feature",
+ "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
+ "properties": {"prop0": "value0"}
+ },
+ { "type": "Feature",
+ "geometry": {
+ "type": "LineString",
+ "coordinates": [
+ [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
+ ]
+ },
+ "properties": {
+ "prop0": "value1",
+ "prop1": 0.0
+ }
+ },
+ { "type": "Feature",
+ "geometry": {
+ "type": "Polygon",
+ "coordinates": [
+ [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
+ [100.0, 1.0], [100.0, 0.0] ]
+ ]
+ },
+ "properties": {
+ "prop0": "value2",
+ "prop1": {"this": "that"}
+ }
+ }
+ ]
+}
+```
+
+Notice how the data is modeled as a `FeatureCollection`. Each feature has a
geometry type, geometry coordinates, and properties.
+
+You can also read many multiline GeoJSON files. Suppose you have the
following GeoJSON files:
+
+```
+many_geojsons/
+ file1.json
+ file2.json
+```
+
+Here's how you can read many GeoJSON files:
+
+```python
+df = (
+ sedona.read.format("geojson")
+ .option("multiLine", "true")
+ .load("data/many_geojsons")
+)
+```
+
+You just need to pass the directory that contains the JSON files.
+
+Multiline GeoJSON is nicely formatted for humans but inefficient for machines.
It’s better to store all the JSON data in a single line.
+
+## Read single-line GeoJSON files with Sedona and Spark
+
+Here’s how to read single-line GeoJSON files with Sedona:
+
+```python
+df = (
+ sedona.read.format("geojson")
+ .load("data/singleline_geojson.json")
+ .withColumn("prop0", expr("properties['prop0']"))
+ .drop("properties")
+ .drop("type")
+)
+df.show(truncate=False)
+```
+
+Here’s the result:
+
+```
++---------------------------------------------+------+
+|geometry |prop0 |
++---------------------------------------------+------+
+|POINT (102 0.5) |value0|
+|LINESTRING (102 0, 103 1, 104 0, 105 1) |value1|
+|POLYGON ((100 0, 101 0, 101 1, 100 1, 100 0))|value2|
++---------------------------------------------+------+
+```
+
+Here’s the data:
+
+```
+{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}}
+{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}}
+{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}}
+```
+
+Notice how the multi-line GeoJSON uses a `FeatureCollection` whereas each
single-line GeoJSON row uses a different `Feature`.
+
+Single-line GeoJSON files are better because they’re splittable by query
engines.
+
+Now, let's see how to create GeoJSON files with Sedona by writing out
DataFrames.
+
+## Write to GeoJSON with Sedona and Spark
+
+Let’s create a Sedona DataFrame and then write it out to GeoJSON files:
+
+```
+df = sedona.createDataFrame([
+ ("a", 'LINESTRING(2.0 5.0,6.0 1.0)'),
+ ("b", 'LINESTRING(7.0 4.0,9.0 2.0)'),
+ ("c", 'LINESTRING(1.0 3.0,3.0 1.0)'),
+], ["id", "geometry"])
+actual = df.withColumn("geometry", ST_GeomFromText(col("geometry")))
+actual.write.format("geojson").mode("overwrite").save("/tmp/a_thing")
+```
+
+Here are the files that get written:
+
+```
+a_thing/
+ _SUCCESS
+ part-00000-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json
+ part-00003-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json
+ part-00007-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json
+ part-00011-856044c5-ae35-4306-bf7a-ae9c3cb25434-c000.json
+```
+
+Sedona writes multiple GeoJSON files in parallel, which is faster than writing
a single file.
+
+Note that the DataFrame must contain a column named geometry for the write
operation to work.
+
+Now let’s read these GeoJSON files into a DataFrame:
+
+```python
+df = sedona.read.format("geojson").load("/tmp/a_thing")
+df.show(truncate=False)
+```
+
+```
++---------------------+----------+-------+
+|geometry |properties|type |
++---------------------+----------+-------+
+|LINESTRING (1 3, 3 1)|{c} |Feature|
+|LINESTRING (2 5, 6 1)|{a} |Feature|
+|LINESTRING (7 4, 9 2)|{b} |Feature|
++---------------------+----------+-------+
+```
+
+## Benefits of the GeoJSON file format
+
+The GeoJSON file format has many advantages:
+
+* It is human-readable
+* It can be output in multiple files, which allows for faster I/O for parallel
processing engines.
+* Many engines support GeoJSON / JSON files.
+
+However, GeoJSON has many downsides, making it a suboptimal choice for storing
geospatial data.
+
+## Limitations of the GeoJSON file format
+
+The GeoJSON format has many limitations that can make it a slow option for
spatial data lakes:
+
+* A GeoJSON object may have a CRS, but it's optional, so this critical data
can be lost.
+* It’s a row-oriented file format, so performance optimizations like column
pruning aren’t available (column-oriented file formats, like GeoParquet, can
take advantage of this optimization).
+* It does not store metadata information on row groups, so row-group filtering
isn’t possible (row-group filtering is a Parquet performance optimization).
+* The schema is not specified in the footer, so it needs to be manually
written or inferred.
+* The GeoJSON specification requires a specific structure that can be rigid
for certain types of datasets.
+* You can only build GeoJSON data lakes. You can’t use GeoJSON to build data
lakehouses.
+
+## Conclusion
+
+GeoJSON is a common file format in spatial data analyses, and it’s convenient
that Apache Sedona offers full read and write capabilities.
+
+GeoJSON is well-supported and human-readable, but it’s pretty slow compared to
formats like GeoParquet. It’s generally best to use GeoParquet or Iceberg for
spatial data analyses because the performance is much better.
diff --git a/mkdocs.yml b/mkdocs.yml
index 82073f4ab5..959218a208 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -60,6 +60,8 @@ nav:
- Spatial RDD app: tutorial/rdd.md
- Sedona R: api/rdocs
- Work with GeoPandas and Shapely: tutorial/geopandas-shapely.md
+ - Files:
+ - GeoJSON: tutorial/files/geojson-sedona-spark.md
- Map visualization SQL app:
- Scala/Java: tutorial/viz.md
- Use Apache Zeppelin: tutorial/zeppelin.md