[sedona] branch release-1.4.0 updated: Fix a number of tutorials

jiayu Sun, 19 Mar 2023 16:59:23 -0700

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch release-1.4.0
in repository https://gitbox.apache.org/repos/asf/sedona.git



The following commit(s) were added to refs/heads/release-1.4.0 by this push:
     new 5b6074cf Fix a number of tutorials
5b6074cf is described below

commit 5b6074cf906c78bb5214ba535631f0bea7ef5412
Author: Jia Yu <[email protected]>
AuthorDate: Sun Mar 19 16:59:09 2023 -0700

    Fix a number of tutorials
---
 docs/api/sql/Optimizer.md          | 29 +++++++++---
 docs/setup/release-notes.md        |  6 ++-
 docs/tutorial/flink/sql.md         | 90 +++++++++++++++++++++++++++++++++++---
 docs/tutorial/geopandas-shapely.md |  3 ++
 docs/tutorial/rdd.md               |  2 +
 5 files changed, 117 insertions(+), 13 deletions(-)

diff --git a/docs/api/sql/Optimizer.md b/docs/api/sql/Optimizer.md
index 2019d51a..025b964b 100644
--- a/docs/api/sql/Optimizer.md
+++ b/docs/api/sql/Optimizer.md
@@ -43,6 +43,7 @@ RangeJoin polygonshape#20: geometry, pointshape#43: geometry, 
false
        All join queries in SedonaSQL are inner joins
 
 ## Distance join
+
 Introduction: Find geometries from A and geometries from B such that the 
internal Euclidean distance of each geometry pair is less or equal than a 
certain distance
 
 Spark SQL Example:
@@ -72,7 +73,7 @@ DistanceJoin pointshape1#12: geometry, pointshape2#33: 
geometry, 2.0, true
 ```
 
 !!!warning
-       Sedona doesn't control the distance's unit (degree or meter). It is 
same with the geometry. To change the geometry's unit, please transform the 
coordinate reference system. See [ST_Transform](Function.md#st_transform).
+       Sedona doesn't control the distance's unit (degree or meter). It is 
same with the geometry. If your coordinates are in the longitude and latitude 
system, the unit of `distance` should be degree instead of meter or mile. To 
change the geometry's unit, please either transform the coordinate reference 
system to a meter-based system. See [ST_Transform](Function.md#st_transform). 
If you don't want to transform your data and are ok with sacrificing the query 
accuracy, you can use an approxima [...]
 
 ## Broadcast index join
 
@@ -127,9 +128,9 @@ Note: If the distance is an expression, it is only 
evaluated on the first argume
 
 When one table involved a spatial join query is smaller than a threadhold, 
Sedona will automatically choose broadcast index join instead of Sedona 
optimized join. The current threshold is controlled by 
[sedona.join.autoBroadcastJoinThreshold](../Parameter) and set to the same as 
`spark.sql.autoBroadcastJoinThreshold`.
 
-## Google S2 based equi-join
+## Google S2 based approximate equi-join
 
-If the performance of Sedona optimized join is not ideal, which is possibly 
caused by  complicated and overlapping geometries, you can resort to Sedona 
built-in Google S2-based equi-join. This equi-join leverages Spark's internal 
equi-join algorithm and might be performant in some cases given that the 
refinement step is optional.
+If the performance of Sedona optimized join is not ideal, which is possibly 
caused by  complicated and overlapping geometries, you can resort to Sedona 
built-in Google S2-based approximate equi-join. This equi-join leverages 
Spark's internal equi-join algorithm and might be performant given that you can 
opt to skip the refinement step  by sacrificing query accuracy.
 
 Please use the following steps:
 
@@ -161,14 +162,16 @@ FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId
 
 Due to the nature of S2 Cellid, the equi-join results might have a few 
false-positives depending on the S2 level you choose. A smaller level indicates 
bigger cells, less exploded rows, but more false positives.
 
-To ensure the correctness, you can use [Spatial Predicate](../Predicate/) to 
filter out them. 
+To ensure the correctness, you can use one of the [Spatial 
Predicates](../Predicate/) to filter out them. Use this query instead of the 
query in Step 2.
 
 ```sql
-SELECT *
-FROM joinresult
-WHERE ST_Contains(lcs.geom, rcs.geom)
+SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as 
rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
+FROM lcs, rcs
+WHERE lcs.cellId = rcs.cellId AND ST_Contains(lcs.geom, rcs.geom)
 ```
 
+As you see, compared to the query in Step 2, we added one more filter, which 
is `ST_Contains`, to remove false positives. You can also use `ST_Intersects` 
and so on.
+
 !!!tip
        You can skip this step if you don't need 100% accuracy and want faster 
query speed.
 
@@ -195,6 +198,18 @@ GROUP BY (lcs_geom, rcs_geom)
 !!!note
        If you are doing point-in-polygon join, this is not a problem and you 
can safely discard this issue. This issue only happens when you do 
polygon-polygon, polygon-linestring, linestring-linestring join.
  
+### S2 for distance join
+
+This also works for distance join. You first need to use `ST_Buffer(geometry, 
distance)` to wrap one of your original geometry column. If your original 
geometry column contains points, this `ST_Buffer` will make them become circles 
with a radius of `distance`.
+
+For example. run this query first on the left table before Step 1.
+
+```sql
+SELECT id, ST_Buffer(geom, DISTANCE), name
+FROM lefts
+```
+
+Since the coordinates are in the longitude and latitude system, so the unit of 
`distance` should be degree instead of meter or mile. You will have to estimate 
the corresponding degrees based on your meter values. Please use [this 
calculator](https://lucidar.me/en/online-unit-converter-length-to-angle/convert-degrees-to-meters/#online-converter).
 
 ## Regular spatial predicate pushdown
 Introduction: Given a join query and a predicate in the same WHERE clause, 
first executes the Predicate as a filter, then executes the join query.
diff --git a/docs/setup/release-notes.md b/docs/setup/release-notes.md
index 99179c5a..8e6eb3f0 100644
--- a/docs/setup/release-notes.md
+++ b/docs/setup/release-notes.md
@@ -1,14 +1,18 @@
 !!!warning
        Support of Spark 2.X and Scala 2.11 was removed in Sedona 1.3.0+ 
although some parts of the source code might still be compatible. Sedona 1.3.0+ 
releases binary for both Scala 2.12 and 2.13.
 
+!!!danger
+       Sedona Python currently only works with Shapely 1.x. If you use 
GeoPandas, please use <= GeoPandas `0.11.1`. GeoPandas > 0.11.1 will 
automatically installe Shapely 2.0. If you use Shapely, please use <= `1.8.4`.
+
 ## Sedona 1.4.0
 
 Sedona 1.4.0 is compiled against, Spark 3.3 / Flink 1.12, Java 8.
 
 ### Highlights
 
-* [X] **Sedona Spark** Pushdown spatial predicate on GeoParquet to reduce 
memory consumption by 10X: see 
[explanation](../../api/sql/Optimizer/#geoparquet)
 * [X] **Sedona Spark & Flink** Serialize and deserialize geometries 3 - 7X 
faster
+* [X] **Sedona Spark & Flink** Google S2 based spatial join for fast 
approximate point-in-polygon join. See [Join query in 
Spark](../../api/sql/Optimizer/#google-s2-based-approximate-equi-join) and 
[Join query in Flink](../../tutorial/flink/sql/#join-query)
+* [X] **Sedona Spark** Pushdown spatial predicate on GeoParquet to reduce 
memory consumption by 10X: see 
[explanation](../../api/sql/Optimizer/#geoparquet)
 * [X] **Sedona Spark** Automatically use broadcast index spatial join for 
small datasets
 * [X] **Sedona Spark** New RasterUDT added to Sedona GeoTiff reader.
 * [X] **Sedona Spark** A number of bug fixes and improvement to the Sedona R 
module.
diff --git a/docs/tutorial/flink/sql.md b/docs/tutorial/flink/sql.md
index facdfaec..a80835f8 100644
--- a/docs/tutorial/flink/sql.md
+++ b/docs/tutorial/flink/sql.md
@@ -166,12 +166,9 @@ After the transformation:
 +----+--------------------------------+--------------------------------+
 ```
 
-
-## Run spatial queries
-
 After creating a Geometry type column, you are able to run spatial queries.
 
-### Range query
+## Range query
 
 Use ==ST_Contains==, ==ST_Intersects== and so on to run a range query over a 
single column.
 
@@ -190,7 +187,7 @@ geomTable.execute().print()
 !!!note
        Read [SedonaSQL Predicate API](../../../api/flink/Predicate) to learn 
different spatial query predicates.
        
-### KNN query
+## KNN query
 
 Use ==ST_Distance== to calculate the distance and rank the distance.
 
@@ -207,6 +204,89 @@ geomTable = tableEnv.sqlQuery(
 geomTable.execute().print()
 ```
 
+## Join query
+
+This equi-join leverages Flink's internal equi-join algorithm. You can opt to 
skip the Sedona refinement step  by sacrificing query accuracy.
+
+Please use the following steps:
+
+### 1. Generate S2 ids for both tables
+
+Use [ST_S2CellIds](../../../api/flink/Function/#st_s2cellids) to generate cell 
IDs. Each geometry may produce one or more IDs.
+
+```sql
+SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
+FROM lefts
+```
+
+```sql
+SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
+FROM rights
+```
+
+### 2. Perform equi-join
+
+Join the two tables by their S2 cellId
+
+```sql
+SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as 
rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
+FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId
+```
+
+
+### 3. Optional: Refine the result
+
+Due to the nature of S2 Cellid, the equi-join results might have a few 
false-positives depending on the S2 level you choose. A smaller level indicates 
bigger cells, less exploded rows, but more false positives.
+
+To ensure the correctness, you can use one of the [Spatial 
Predicates](../../../api/Predicate/) to filter out them. Use this query instead 
of the query in Step 2.
+
+```sql
+SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as 
rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
+FROM lcs, rcs
+WHERE lcs.cellId = rcs.cellId AND ST_Contains(lcs.geom, rcs.geom)
+```
+
+As you see, compared to the query in Step 2, we added one more filter, which 
is `ST_Contains`, to remove false positives. You can also use `ST_Intersects` 
and so on.
+
+!!!tip
+       You can skip this step if you don't need 100% accuracy and want faster 
query speed.
+
+### 4. Optional: De-duplcate
+
+Due to the explode function used when we generate S2 Cell Ids, the resulting 
DataFrame may have several duplicate <lcs_geom, rcs_geom> matches. You can 
remove them by performing a GroupBy query.
+
+```sql
+SELECT lcs_id, rcs_id , FIRST_VALUE(lcs_geom), FIRST_VALUE(lcs_name), 
first(rcs_geom), first(rcs_name)
+FROM joinresult
+GROUP BY (lcs_id, rcs_id)
+```
+
+The `FIRST_VALUE` function is to take the first value from a number of 
duplicate values.
+
+If you don't have a unique id for each geometry, you can also group by 
geometry itself. See below:
+
+```sql
+SELECT lcs_geom, rcs_geom, first(lcs_name), first(rcs_name)
+FROM joinresult
+GROUP BY (lcs_geom, rcs_geom)
+```
+
+!!!note
+       If you are doing point-in-polygon join, this is not a problem and you 
can safely discard this issue. This issue only happens when you do 
polygon-polygon, polygon-linestring, linestring-linestring join.
+
+### S2 for distance join
+
+This also works for distance join. You first need to use `ST_Buffer(geometry, 
distance)` to wrap one of your original geometry column. If your original 
geometry column contains points, this `ST_Buffer` will make them become circles 
with a radius of `distance`.
+
+For example. run this query first on the left table before Step 1.
+
+```sql
+SELECT id, ST_Buffer(geom, DISTANCE), name
+FROM lefts
+```
+
+Since the coordinates are in the longitude and latitude system, so the unit of 
`distance` should be degree instead of meter or mile. You will have to estimate 
the corresponding degrees based on your meter values. Please use [this 
calculator](https://lucidar.me/en/online-unit-converter-length-to-angle/convert-degrees-to-meters/#online-converter).
+
 ## Convert Spatial Table to Spatial DataStream
 
 ### Get DataStream
diff --git a/docs/tutorial/geopandas-shapely.md 
b/docs/tutorial/geopandas-shapely.md
index 4b1251fa..96f9774b 100644
--- a/docs/tutorial/geopandas-shapely.md
+++ b/docs/tutorial/geopandas-shapely.md
@@ -1,5 +1,8 @@
 # Work with GeoPandas and Shapely
 
+!!!danger
+       Sedona Python currently only works with Shapely 1.x. If you use 
GeoPandas, please use <= GeoPandas `0.11.1`. GeoPandas > 0.11.1 will 
automatically installe Shapely 2.0. If you use Shapely, please use <= `1.8.4`.
+
 ## Interoperate with GeoPandas
 
 Sedona Python has implemented serializers and deserializers which allows to 
convert Sedona Geometry objects into Shapely BaseGeometry objects. Based on 
that it is possible to load the data with geopandas from file (look at Fiona 
possible drivers) and create Spark DataFrame based on GeoDataFrame object.
diff --git a/docs/tutorial/rdd.md b/docs/tutorial/rdd.md
index 99880f38..077415dc 100644
--- a/docs/tutorial/rdd.md
+++ b/docs/tutorial/rdd.md
@@ -928,6 +928,8 @@ The index should be built on either one of two SpatialRDDs. 
In general, you shou
 
 A distance join query takes as input two Spatial RDD A and B and a distance. 
For each geometry in A, finds the geometries (from B) are within the given 
distance to it. A and B can be any geometry type and are not necessary to have 
the same geometry type. The unit of the distance is explained 
[here](#transform-the-coordinate-reference-system).
 
+If you don't want to transform your data and are ok with sacrificing the query 
accuracy, you can use an approximate degree value for distance. Please use 
[this 
calculator](https://lucidar.me/en/online-unit-converter-length-to-angle/convert-degrees-to-meters/#online-converter).
+
 Assume you now have two SpatialRDDs (typed or generic). You can use the 
following code to issue an Distance Join Query on them.
 
 === "Scala"

[sedona] branch release-1.4.0 updated: Fix a number of tutorials

Reply via email to