This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch release-1.4.0
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/release-1.4.0 by this push:
new 5b6074cf Fix a number of tutorials
5b6074cf is described below
commit 5b6074cf906c78bb5214ba535631f0bea7ef5412
Author: Jia Yu <[email protected]>
AuthorDate: Sun Mar 19 16:59:09 2023 -0700
Fix a number of tutorials
---
docs/api/sql/Optimizer.md | 29 +++++++++---
docs/setup/release-notes.md | 6 ++-
docs/tutorial/flink/sql.md | 90 +++++++++++++++++++++++++++++++++++---
docs/tutorial/geopandas-shapely.md | 3 ++
docs/tutorial/rdd.md | 2 +
5 files changed, 117 insertions(+), 13 deletions(-)
diff --git a/docs/api/sql/Optimizer.md b/docs/api/sql/Optimizer.md
index 2019d51a..025b964b 100644
--- a/docs/api/sql/Optimizer.md
+++ b/docs/api/sql/Optimizer.md
@@ -43,6 +43,7 @@ RangeJoin polygonshape#20: geometry, pointshape#43: geometry,
false
All join queries in SedonaSQL are inner joins
## Distance join
+
Introduction: Find geometries from A and geometries from B such that the
internal Euclidean distance of each geometry pair is less or equal than a
certain distance
Spark SQL Example:
@@ -72,7 +73,7 @@ DistanceJoin pointshape1#12: geometry, pointshape2#33:
geometry, 2.0, true
```
!!!warning
- Sedona doesn't control the distance's unit (degree or meter). It is
same with the geometry. To change the geometry's unit, please transform the
coordinate reference system. See [ST_Transform](Function.md#st_transform).
+ Sedona doesn't control the distance's unit (degree or meter). It is
same with the geometry. If your coordinates are in the longitude and latitude
system, the unit of `distance` should be degree instead of meter or mile. To
change the geometry's unit, please either transform the coordinate reference
system to a meter-based system. See [ST_Transform](Function.md#st_transform).
If you don't want to transform your data and are ok with sacrificing the query
accuracy, you can use an approxima [...]
## Broadcast index join
@@ -127,9 +128,9 @@ Note: If the distance is an expression, it is only
evaluated on the first argume
When one table involved a spatial join query is smaller than a threadhold,
Sedona will automatically choose broadcast index join instead of Sedona
optimized join. The current threshold is controlled by
[sedona.join.autoBroadcastJoinThreshold](../Parameter) and set to the same as
`spark.sql.autoBroadcastJoinThreshold`.
-## Google S2 based equi-join
+## Google S2 based approximate equi-join
-If the performance of Sedona optimized join is not ideal, which is possibly
caused by complicated and overlapping geometries, you can resort to Sedona
built-in Google S2-based equi-join. This equi-join leverages Spark's internal
equi-join algorithm and might be performant in some cases given that the
refinement step is optional.
+If the performance of Sedona optimized join is not ideal, which is possibly
caused by complicated and overlapping geometries, you can resort to Sedona
built-in Google S2-based approximate equi-join. This equi-join leverages
Spark's internal equi-join algorithm and might be performant given that you can
opt to skip the refinement step by sacrificing query accuracy.
Please use the following steps:
@@ -161,14 +162,16 @@ FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId
Due to the nature of S2 Cellid, the equi-join results might have a few
false-positives depending on the S2 level you choose. A smaller level indicates
bigger cells, less exploded rows, but more false positives.
-To ensure the correctness, you can use [Spatial Predicate](../Predicate/) to
filter out them.
+To ensure the correctness, you can use one of the [Spatial
Predicates](../Predicate/) to filter out them. Use this query instead of the
query in Step 2.
```sql
-SELECT *
-FROM joinresult
-WHERE ST_Contains(lcs.geom, rcs.geom)
+SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as
rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
+FROM lcs, rcs
+WHERE lcs.cellId = rcs.cellId AND ST_Contains(lcs.geom, rcs.geom)
```
+As you see, compared to the query in Step 2, we added one more filter, which
is `ST_Contains`, to remove false positives. You can also use `ST_Intersects`
and so on.
+
!!!tip
You can skip this step if you don't need 100% accuracy and want faster
query speed.
@@ -195,6 +198,18 @@ GROUP BY (lcs_geom, rcs_geom)
!!!note
If you are doing point-in-polygon join, this is not a problem and you
can safely discard this issue. This issue only happens when you do
polygon-polygon, polygon-linestring, linestring-linestring join.
+### S2 for distance join
+
+This also works for distance join. You first need to use `ST_Buffer(geometry,
distance)` to wrap one of your original geometry column. If your original
geometry column contains points, this `ST_Buffer` will make them become circles
with a radius of `distance`.
+
+For example. run this query first on the left table before Step 1.
+
+```sql
+SELECT id, ST_Buffer(geom, DISTANCE), name
+FROM lefts
+```
+
+Since the coordinates are in the longitude and latitude system, so the unit of
`distance` should be degree instead of meter or mile. You will have to estimate
the corresponding degrees based on your meter values. Please use [this
calculator](https://lucidar.me/en/online-unit-converter-length-to-angle/convert-degrees-to-meters/#online-converter).
## Regular spatial predicate pushdown
Introduction: Given a join query and a predicate in the same WHERE clause,
first executes the Predicate as a filter, then executes the join query.
diff --git a/docs/setup/release-notes.md b/docs/setup/release-notes.md
index 99179c5a..8e6eb3f0 100644
--- a/docs/setup/release-notes.md
+++ b/docs/setup/release-notes.md
@@ -1,14 +1,18 @@
!!!warning
Support of Spark 2.X and Scala 2.11 was removed in Sedona 1.3.0+
although some parts of the source code might still be compatible. Sedona 1.3.0+
releases binary for both Scala 2.12 and 2.13.
+!!!danger
+ Sedona Python currently only works with Shapely 1.x. If you use
GeoPandas, please use <= GeoPandas `0.11.1`. GeoPandas > 0.11.1 will
automatically installe Shapely 2.0. If you use Shapely, please use <= `1.8.4`.
+
## Sedona 1.4.0
Sedona 1.4.0 is compiled against, Spark 3.3 / Flink 1.12, Java 8.
### Highlights
-* [X] **Sedona Spark** Pushdown spatial predicate on GeoParquet to reduce
memory consumption by 10X: see
[explanation](../../api/sql/Optimizer/#geoparquet)
* [X] **Sedona Spark & Flink** Serialize and deserialize geometries 3 - 7X
faster
+* [X] **Sedona Spark & Flink** Google S2 based spatial join for fast
approximate point-in-polygon join. See [Join query in
Spark](../../api/sql/Optimizer/#google-s2-based-approximate-equi-join) and
[Join query in Flink](../../tutorial/flink/sql/#join-query)
+* [X] **Sedona Spark** Pushdown spatial predicate on GeoParquet to reduce
memory consumption by 10X: see
[explanation](../../api/sql/Optimizer/#geoparquet)
* [X] **Sedona Spark** Automatically use broadcast index spatial join for
small datasets
* [X] **Sedona Spark** New RasterUDT added to Sedona GeoTiff reader.
* [X] **Sedona Spark** A number of bug fixes and improvement to the Sedona R
module.
diff --git a/docs/tutorial/flink/sql.md b/docs/tutorial/flink/sql.md
index facdfaec..a80835f8 100644
--- a/docs/tutorial/flink/sql.md
+++ b/docs/tutorial/flink/sql.md
@@ -166,12 +166,9 @@ After the transformation:
+----+--------------------------------+--------------------------------+
```
-
-## Run spatial queries
-
After creating a Geometry type column, you are able to run spatial queries.
-### Range query
+## Range query
Use ==ST_Contains==, ==ST_Intersects== and so on to run a range query over a
single column.
@@ -190,7 +187,7 @@ geomTable.execute().print()
!!!note
Read [SedonaSQL Predicate API](../../../api/flink/Predicate) to learn
different spatial query predicates.
-### KNN query
+## KNN query
Use ==ST_Distance== to calculate the distance and rank the distance.
@@ -207,6 +204,89 @@ geomTable = tableEnv.sqlQuery(
geomTable.execute().print()
```
+## Join query
+
+This equi-join leverages Flink's internal equi-join algorithm. You can opt to
skip the Sedona refinement step by sacrificing query accuracy.
+
+Please use the following steps:
+
+### 1. Generate S2 ids for both tables
+
+Use [ST_S2CellIds](../../../api/flink/Function/#st_s2cellids) to generate cell
IDs. Each geometry may produce one or more IDs.
+
+```sql
+SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
+FROM lefts
+```
+
+```sql
+SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
+FROM rights
+```
+
+### 2. Perform equi-join
+
+Join the two tables by their S2 cellId
+
+```sql
+SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as
rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
+FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId
+```
+
+
+### 3. Optional: Refine the result
+
+Due to the nature of S2 Cellid, the equi-join results might have a few
false-positives depending on the S2 level you choose. A smaller level indicates
bigger cells, less exploded rows, but more false positives.
+
+To ensure the correctness, you can use one of the [Spatial
Predicates](../../../api/Predicate/) to filter out them. Use this query instead
of the query in Step 2.
+
+```sql
+SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as
rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
+FROM lcs, rcs
+WHERE lcs.cellId = rcs.cellId AND ST_Contains(lcs.geom, rcs.geom)
+```
+
+As you see, compared to the query in Step 2, we added one more filter, which
is `ST_Contains`, to remove false positives. You can also use `ST_Intersects`
and so on.
+
+!!!tip
+ You can skip this step if you don't need 100% accuracy and want faster
query speed.
+
+### 4. Optional: De-duplcate
+
+Due to the explode function used when we generate S2 Cell Ids, the resulting
DataFrame may have several duplicate <lcs_geom, rcs_geom> matches. You can
remove them by performing a GroupBy query.
+
+```sql
+SELECT lcs_id, rcs_id , FIRST_VALUE(lcs_geom), FIRST_VALUE(lcs_name),
first(rcs_geom), first(rcs_name)
+FROM joinresult
+GROUP BY (lcs_id, rcs_id)
+```
+
+The `FIRST_VALUE` function is to take the first value from a number of
duplicate values.
+
+If you don't have a unique id for each geometry, you can also group by
geometry itself. See below:
+
+```sql
+SELECT lcs_geom, rcs_geom, first(lcs_name), first(rcs_name)
+FROM joinresult
+GROUP BY (lcs_geom, rcs_geom)
+```
+
+!!!note
+ If you are doing point-in-polygon join, this is not a problem and you
can safely discard this issue. This issue only happens when you do
polygon-polygon, polygon-linestring, linestring-linestring join.
+
+### S2 for distance join
+
+This also works for distance join. You first need to use `ST_Buffer(geometry,
distance)` to wrap one of your original geometry column. If your original
geometry column contains points, this `ST_Buffer` will make them become circles
with a radius of `distance`.
+
+For example. run this query first on the left table before Step 1.
+
+```sql
+SELECT id, ST_Buffer(geom, DISTANCE), name
+FROM lefts
+```
+
+Since the coordinates are in the longitude and latitude system, so the unit of
`distance` should be degree instead of meter or mile. You will have to estimate
the corresponding degrees based on your meter values. Please use [this
calculator](https://lucidar.me/en/online-unit-converter-length-to-angle/convert-degrees-to-meters/#online-converter).
+
## Convert Spatial Table to Spatial DataStream
### Get DataStream
diff --git a/docs/tutorial/geopandas-shapely.md
b/docs/tutorial/geopandas-shapely.md
index 4b1251fa..96f9774b 100644
--- a/docs/tutorial/geopandas-shapely.md
+++ b/docs/tutorial/geopandas-shapely.md
@@ -1,5 +1,8 @@
# Work with GeoPandas and Shapely
+!!!danger
+ Sedona Python currently only works with Shapely 1.x. If you use
GeoPandas, please use <= GeoPandas `0.11.1`. GeoPandas > 0.11.1 will
automatically installe Shapely 2.0. If you use Shapely, please use <= `1.8.4`.
+
## Interoperate with GeoPandas
Sedona Python has implemented serializers and deserializers which allows to
convert Sedona Geometry objects into Shapely BaseGeometry objects. Based on
that it is possible to load the data with geopandas from file (look at Fiona
possible drivers) and create Spark DataFrame based on GeoDataFrame object.
diff --git a/docs/tutorial/rdd.md b/docs/tutorial/rdd.md
index 99880f38..077415dc 100644
--- a/docs/tutorial/rdd.md
+++ b/docs/tutorial/rdd.md
@@ -928,6 +928,8 @@ The index should be built on either one of two SpatialRDDs.
In general, you shou
A distance join query takes as input two Spatial RDD A and B and a distance.
For each geometry in A, finds the geometries (from B) are within the given
distance to it. A and B can be any geometry type and are not necessary to have
the same geometry type. The unit of the distance is explained
[here](#transform-the-coordinate-reference-system).
+If you don't want to transform your data and are ok with sacrificing the query
accuracy, you can use an approximate degree value for distance. Please use
[this
calculator](https://lucidar.me/en/online-unit-converter-length-to-angle/convert-degrees-to-meters/#online-converter).
+
Assume you now have two SpatialRDDs (typed or generic). You can use the
following code to issue an Distance Join Query on them.
=== "Scala"