[sedona] branch prepare-1.4.0-doc updated: Update all tutorials and docs

jiayu Wed, 15 Mar 2023 14:44:43 -0700

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch prepare-1.4.0-doc
in repository https://gitbox.apache.org/repos/asf/sedona.git



The following commit(s) were added to refs/heads/prepare-1.4.0-doc by this push:
     new d1b9c92a Update all tutorials and docs
d1b9c92a is described below

commit d1b9c92abf9b164f46eb64d72778049cc364dda9
Author: Jia Yu <[email protected]>
AuthorDate: Wed Mar 15 14:44:31 2023 -0700

    Update all tutorials and docs
---
 docs/api/sql/Optimizer.md          | 91 +++++++++++++++++++++++++++++++++++---
 docs/setup/compile.md              |  8 ++--
 docs/setup/release-notes.md        |  8 ++--
 docs/tutorial/jupyter-notebook.md  |  2 +-
 docs/tutorial/python-vector-osm.md |  2 +-
 docs/tutorial/sql-pure-sql.md      |  2 +-
 mkdocs.yml                         |  5 +--
 7 files changed, 98 insertions(+), 20 deletions(-)

diff --git a/docs/api/sql/Optimizer.md b/docs/api/sql/Optimizer.md
index 7034e44c..2019d51a 100644
--- a/docs/api/sql/Optimizer.md
+++ b/docs/api/sql/Optimizer.md
@@ -1,9 +1,11 @@
-# SedonaSQL query optimizer
 Sedona Spatial operators fully supports Apache SparkSQL query optimizer. It 
has the following query optimization features:
 
 * Automatically optimizes range join query and distance join query.
 * Automatically performs predicate pushdown.
 
+!!! tip
+       Sedona join performance is heavily affected by the number of 
partitions. If the join performance is not ideal, please increase the number of 
partitions by doing `df.repartition(XXX)` right after you create the original 
DataFrame.
+
 ## Range join
 Introduction: Find geometries from A and geometries from B such that each 
geometry pair satisfies a certain predicate. Most predicates supported by 
SedonaSQL can trigger a range join.
 
@@ -72,9 +74,12 @@ DistanceJoin pointshape1#12: geometry, pointshape2#33: 
geometry, 2.0, true
 !!!warning
        Sedona doesn't control the distance's unit (degree or meter). It is 
same with the geometry. To change the geometry's unit, please transform the 
coordinate reference system. See [ST_Transform](Function.md#st_transform).
 
-## Broadcast join
-Introduction: Perform a range join or distance join but broadcast one of the 
sides of the join.
-This maintains the partitioning of the non-broadcast side and doesn't require 
a shuffle.
+## Broadcast index join
+
+Introduction: Perform a range join or distance join but broadcast one of the 
sides of the join. This maintains the partitioning of the non-broadcast side 
and doesn't require a shuffle.
+
+Sedona will create a spatial index on the broadcasted table.
+
 Sedona uses broadcast join only if the correct side has a broadcast hint.
 The supported join type - broadcast side combinations are:
 
@@ -118,9 +123,81 @@ BroadcastIndexJoin pointshape#52: geometry, BuildRight, 
BuildLeft, true, 2.0 ST_
 
 Note: If the distance is an expression, it is only evaluated on the first 
argument to ST_Distance (`pointDf1` above).
 
-## Predicate pushdown
+## Auotmatic broadcast index join
+
+When one table involved a spatial join query is smaller than a threadhold, 
Sedona will automatically choose broadcast index join instead of Sedona 
optimized join. The current threshold is controlled by 
[sedona.join.autoBroadcastJoinThreshold](../Parameter) and set to the same as 
`spark.sql.autoBroadcastJoinThreshold`.
+
+## Google S2 based equi-join
+
+If the performance of Sedona optimized join is not ideal, which is possibly 
caused by  complicated and overlapping geometries, you can resort to Sedona 
built-in Google S2-based equi-join. This equi-join leverages Spark's internal 
equi-join algorithm and might be performant in some cases given that the 
refinement step is optional.
+
+Please use the following steps:
+
+### 1. Generate S2 ids for both tables
+
+Use [ST_S2CellIds](../Function/#st_s2cellids) to generate cell IDs. Each 
geometry may produce one or more IDs.
+
+```sql
+SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
+FROM lefts
+```
+
+```sql
+SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
+FROM rights
+```
+
+### 2. Perform equi-join
+
+Join the two tables by their S2 cellId
+
+```sql
+SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as 
rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
+FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId
+```
+
+
+### 3. Optional: Refine the result
+
+Due to the nature of S2 Cellid, the equi-join results might have a few 
false-positives depending on the S2 level you choose. A smaller level indicates 
bigger cells, less exploded rows, but more false positives.
+
+To ensure the correctness, you can use [Spatial Predicate](../Predicate/) to 
filter out them. 
+
+```sql
+SELECT *
+FROM joinresult
+WHERE ST_Contains(lcs.geom, rcs.geom)
+```
+
+!!!tip
+       You can skip this step if you don't need 100% accuracy and want faster 
query speed.
+
+### 4. Optional: De-duplcate
+
+Due to the explode function used when we generate S2 Cell Ids, the resulting 
DataFrame may have several duplicate <lcs_geom, rcs_geom> matches. You can 
remove them by performing a GroupBy query.
+
+```sql
+SELECT lcs_id, rcs_id , first(lcs_geom), first(lcs_name), first(rcs_geom), 
first(rcs_name)
+FROM joinresult
+GROUP BY (lcs_id, rcs_id)
+```
+
+The `first` function is to take the first value from a number of duplicate 
values.
+
+If you don't have a unique id for each geometry, you can also group by 
geometry itself. See below:
+
+```sql
+SELECT lcs_geom, rcs_geom, first(lcs_name), first(rcs_name)
+FROM joinresult
+GROUP BY (lcs_geom, rcs_geom)
+```
+
+!!!note
+       If you are doing point-in-polygon join, this is not a problem and you 
can safely discard this issue. This issue only happens when you do 
polygon-polygon, polygon-linestring, linestring-linestring join.
+ 
 
-Introduction: Given a join query and a predicate in the same WHERE clause, 
first executes the Predicate as a filter, then executes the join query*
+## Regular spatial predicate pushdown
+Introduction: Given a join query and a predicate in the same WHERE clause, 
first executes the Predicate as a filter, then executes the join query.
 
 Spark SQL Example:
 
@@ -143,7 +220,7 @@ RangeJoin polygonshape#20: geometry, pointshape#43: 
geometry, false
    +- *FileScan csv
 ```
 
-### GeoParquet
+## Push spatial predicates to GeoParquet
 
 Sedona supports spatial predicate push-down for GeoParquet files. When spatial 
filters were applied to dataframes backed by GeoParquet files, Sedona will use 
the
 [`bbox` properties in the 
metadata](https://github.com/opengeospatial/geoparquet/blob/v1.0.0-beta.1/format-specs/geoparquet.md#bbox)
diff --git a/docs/setup/compile.md b/docs/setup/compile.md
index 4680091a..775eaf8a 100644
--- a/docs/setup/compile.md
+++ b/docs/setup/compile.md
@@ -4,7 +4,7 @@
 
 
 ## Compile Scala / Java source code
-Sedona Scala/Java code is a project with four modules, core, sql, viz and 
python adapter. Each module is a Scala/Java mixed project which is managed by 
Apache Maven 3.
+Sedona Scala/Java code is a project with multiple modules. Each module is a 
Scala/Java mixed project which is managed by Apache Maven 3.
 
 * Make sure your Linux/Mac machine has Java 1.8, Apache Maven 3.3.1+, and 
Python3. The compilation of Sedona is not tested on Windows machine.
 
@@ -43,7 +43,7 @@ To compile all modules, please make sure you are in the root 
folder of all modul
        ```
 
 !!!tip
-       To get the Sedona Python-adapter jar with all GeoTools jars included, 
simply append `-Dgeotools` option. The command is like this:`mvn clean install 
-DskipTests -Dscala=2.12 -Dspark=3.0 -Dgeotools`
+       To get the Sedona Spark Shaded jar with all GeoTools jars included, 
simply append `-Dgeotools` option. The command is like this:`mvn clean install 
-DskipTests -Dscala=2.12 -Dspark=3.0 -Dgeotools`
 
 ### Download staged jars
 
@@ -58,9 +58,9 @@ For example,
 export SPARK_HOME=$PWD/spark-3.0.1-bin-hadoop2.7
 export PYTHONPATH=$SPARK_HOME/python
 ```
-2. Compile the Sedona Scala and Java code with `-Dgeotools` and then copy the 
==sedona-python-adapter-{{ sedona.current_version }}.jar== to 
==SPARK_HOME/jars/== folder.
+2. Compile the Sedona Scala and Java code with `-Dgeotools` and then copy the 
==sedona-spark-shaded-{{ sedona.current_version }}.jar== to 
==SPARK_HOME/jars/== folder.
 ```
-cp python-adapter/target/sedona-python-adapter-xxx.jar SPARK_HOME/jars/
+cp spark-shaded/target/sedona-spark-shaded-xxx.jar SPARK_HOME/jars/
 ```
 3. Install the following libraries
 ```
diff --git a/docs/setup/release-notes.md b/docs/setup/release-notes.md
index 715995a4..99179c5a 100644
--- a/docs/setup/release-notes.md
+++ b/docs/setup/release-notes.md
@@ -13,9 +13,13 @@ Sedona 1.4.0 is compiled against, Spark 3.3 / Flink 1.12, 
Java 8.
 * [X] **Sedona Spark** New RasterUDT added to Sedona GeoTiff reader.
 * [X] **Sedona Spark** A number of bug fixes and improvement to the Sedona R 
module.
 
+### API change
+
+* **Sedona Spark & Flink** Packaging strategy changed. See [Maven 
Coordinate](../maven-coordinates). Please change your Sedona dependencies if 
needed. We recommend `sedona-spark-shaded-3.0_2.12-1.4.0` and 
`sedona-flink-shaded-3.0_2.12-1.4.0`
+* **Sedona Spark & Flink** GeoTools-wrapper version upgraded. Please use 
`geotools-wrapper-1.4.0-28.2`.
+
 ### Behavior change
 
-* **Sedona Spark & Flink** Packaging strategy changed. See [Maven 
Coordinate](../maven-coordinates). Please change your Sedona dependencies if 
needed.
 * **Sedona Flink** Sedona Flink no longer outputs any LinearRing type 
geometry. All LinearRing are changed to LineString.
 * **Sedona Spark** Join optimization strategy changed. Sedona no longer 
optimizes spatial join when use a spatial predicate together with a equijoin 
predicate. By default, it prefers equijoin whenever possible. SedonaConf adds a 
config option called `sedona.join.optimizationmode`, it can be configured as 
one of the following values:
        * `all`: optimize all joins having spatial predicate in join 
conditions. This was the behavior of Apache Sedona prior to 1.4.0.
@@ -25,8 +29,6 @@ Sedona 1.4.0 is compiled against, Spark 3.3 / Flink 1.12, 
Java 8.
 When `sedona.join.optimizationmode` is configured as `nonequi`, it won't 
optimize join queries such as `SELECT * FROM A, B WHERE A.x = B.x AND 
ST_Contains(A.geom, B.geom)`, since it is an equi-join with equi-condition `A.x 
= B.x`. Sedona will optimize for `SELECT * FROM A, B WHERE A.x = B.x AND 
ST_Contains(A.geom, B.geom)`
 
 
-
-
 ### Bug
 
 <ul>
diff --git a/docs/tutorial/jupyter-notebook.md 
b/docs/tutorial/jupyter-notebook.md
index 318c073c..bc0a91e2 100644
--- a/docs/tutorial/jupyter-notebook.md
+++ b/docs/tutorial/jupyter-notebook.md
@@ -9,7 +9,7 @@ Please use the following steps to run Jupyter notebook with 
Pipenv on your machi
 
 1. Clone Sedona GitHub repo or download the source code
 2. Install Sedona Python from PyPI or GitHub source: Read [Install Sedona 
Python](../../setup/install-python/#install-sedona) to learn.
-3. Prepare python-adapter jar: Read [Install Sedona 
Python](../../setup/install-python/#prepare-python-adapter-jar) to learn.
+3. Prepare spark-shaded jar: Read [Install Sedona 
Python](../../setup/install-python/#prepare-spark-shaded-jar) to learn.
 4. Setup pipenv python version. Please use your desired Python version.
 ```bash
 cd binder
diff --git a/docs/tutorial/python-vector-osm.md 
b/docs/tutorial/python-vector-osm.md
index 23d10d70..0e20414a 100644
--- a/docs/tutorial/python-vector-osm.md
+++ b/docs/tutorial/python-vector-osm.md
@@ -40,7 +40,7 @@ spark = SparkSession.\
     config('spark.kryoserializer.buffer.max', 2047).\
     config("spark.serializer", KryoSerializer.getName).\
     config("spark.kryo.registrator", SedonaKryoRegistrator.getName).\
-    config("spark.jars.packages", 
"org.apache.sedona:sedona-python-adapter-3.0_2.12:1.1.0-incubating,org.datasyslab:geotools-wrapper:1.1.0-25.2")
 .\
+    config("spark.jars.packages", 
"org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.0,org.datasyslab:geotools-wrapper:1.4.0-28.2")
 .\
     enableHiveSupport().\
     getOrCreate()
 
diff --git a/docs/tutorial/sql-pure-sql.md b/docs/tutorial/sql-pure-sql.md
index cdc6097c..cbc1a8a8 100644
--- a/docs/tutorial/sql-pure-sql.md
+++ b/docs/tutorial/sql-pure-sql.md
@@ -8,7 +8,7 @@ SedonaSQL supports SQL/MM Part3 Spatial SQL Standard. Detailed 
SedonaSQL APIs ar
 Start `spark-sql` as following (replace `<VERSION>` with actual version, like, 
`1.0.1-incubating`):
 
 ```sh
-spark-sql --packages 
org.apache.sedona:sedona-python-adapter-3.0_2.12:<VERSION>,org.apache.sedona:sedona-viz-3.0_2.12:<VERSION>,org.datasyslab:geotools-wrapper:geotools-24.0
 \
+spark-sql --packages 
org.apache.sedona:sedona-spark-shaded-3.0_2.12:<VERSION>,org.apache.sedona:sedona-viz-3.0_2.12:<VERSION>,org.datasyslab:geotools-wrapper:geotools-24.0
 \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf 
spark.kryo.registrator=org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
 \
   --conf 
spark.sql.extensions=org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
diff --git a/mkdocs.yml b/mkdocs.yml
index eb634c97..2e41e4b2 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -54,7 +54,7 @@ nav:
               - Predicate: api/sql/Predicate.md
               - Aggregate function: api/sql/AggregateFunction.md
               - DataFrame Style functions: api/sql/DataFrameAPI.md
-              - SedonaSQL query optimizer: api/sql/Optimizer.md
+              - Query optimization: api/sql/Optimizer.md
           - Raster data:
               - Raster loader: api/sql/Raster-loader.md
               - Raster writer: api/sql/Raster-writer.md
@@ -62,11 +62,10 @@ nav:
           - Parameter: api/sql/Parameter.md
         - RDD (core):
           - Scala/Java doc: api/java-api.md
-          - Python doc: api/python-api.md
-          - R doc: api/r-api.md
         - Viz:
           - DataFrame/SQL: api/viz/sql.md
           - RDD: api/viz/java-api.md
+        - Sedona R: api/rdocs
       - Sedona with Apache Flink:
         - SQL:
           - Overview: api/flink/Overview.md

[sedona] branch prepare-1.4.0-doc updated: Update all tutorials and docs

Reply via email to