This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch prepare-1.4.0-doc
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/prepare-1.4.0-doc by this push:
new d1b9c92a Update all tutorials and docs
d1b9c92a is described below
commit d1b9c92abf9b164f46eb64d72778049cc364dda9
Author: Jia Yu <[email protected]>
AuthorDate: Wed Mar 15 14:44:31 2023 -0700
Update all tutorials and docs
---
docs/api/sql/Optimizer.md | 91 +++++++++++++++++++++++++++++++++++---
docs/setup/compile.md | 8 ++--
docs/setup/release-notes.md | 8 ++--
docs/tutorial/jupyter-notebook.md | 2 +-
docs/tutorial/python-vector-osm.md | 2 +-
docs/tutorial/sql-pure-sql.md | 2 +-
mkdocs.yml | 5 +--
7 files changed, 98 insertions(+), 20 deletions(-)
diff --git a/docs/api/sql/Optimizer.md b/docs/api/sql/Optimizer.md
index 7034e44c..2019d51a 100644
--- a/docs/api/sql/Optimizer.md
+++ b/docs/api/sql/Optimizer.md
@@ -1,9 +1,11 @@
-# SedonaSQL query optimizer
Sedona Spatial operators fully supports Apache SparkSQL query optimizer. It
has the following query optimization features:
* Automatically optimizes range join query and distance join query.
* Automatically performs predicate pushdown.
+!!! tip
+ Sedona join performance is heavily affected by the number of
partitions. If the join performance is not ideal, please increase the number of
partitions by doing `df.repartition(XXX)` right after you create the original
DataFrame.
+
## Range join
Introduction: Find geometries from A and geometries from B such that each
geometry pair satisfies a certain predicate. Most predicates supported by
SedonaSQL can trigger a range join.
@@ -72,9 +74,12 @@ DistanceJoin pointshape1#12: geometry, pointshape2#33:
geometry, 2.0, true
!!!warning
Sedona doesn't control the distance's unit (degree or meter). It is
same with the geometry. To change the geometry's unit, please transform the
coordinate reference system. See [ST_Transform](Function.md#st_transform).
-## Broadcast join
-Introduction: Perform a range join or distance join but broadcast one of the
sides of the join.
-This maintains the partitioning of the non-broadcast side and doesn't require
a shuffle.
+## Broadcast index join
+
+Introduction: Perform a range join or distance join but broadcast one of the
sides of the join. This maintains the partitioning of the non-broadcast side
and doesn't require a shuffle.
+
+Sedona will create a spatial index on the broadcasted table.
+
Sedona uses broadcast join only if the correct side has a broadcast hint.
The supported join type - broadcast side combinations are:
@@ -118,9 +123,81 @@ BroadcastIndexJoin pointshape#52: geometry, BuildRight,
BuildLeft, true, 2.0 ST_
Note: If the distance is an expression, it is only evaluated on the first
argument to ST_Distance (`pointDf1` above).
-## Predicate pushdown
+## Auotmatic broadcast index join
+
+When one table involved a spatial join query is smaller than a threadhold,
Sedona will automatically choose broadcast index join instead of Sedona
optimized join. The current threshold is controlled by
[sedona.join.autoBroadcastJoinThreshold](../Parameter) and set to the same as
`spark.sql.autoBroadcastJoinThreshold`.
+
+## Google S2 based equi-join
+
+If the performance of Sedona optimized join is not ideal, which is possibly
caused by complicated and overlapping geometries, you can resort to Sedona
built-in Google S2-based equi-join. This equi-join leverages Spark's internal
equi-join algorithm and might be performant in some cases given that the
refinement step is optional.
+
+Please use the following steps:
+
+### 1. Generate S2 ids for both tables
+
+Use [ST_S2CellIds](../Function/#st_s2cellids) to generate cell IDs. Each
geometry may produce one or more IDs.
+
+```sql
+SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
+FROM lefts
+```
+
+```sql
+SELECT id, geom, name, explode(ST_S2CellIDs(geom, 15)) as cellId
+FROM rights
+```
+
+### 2. Perform equi-join
+
+Join the two tables by their S2 cellId
+
+```sql
+SELECT lcs.id as lcs_id, lcs.geom as lcs_geom, lcs.name as lcs_name, rcs.id as
rcs_id, rcs.geom as rcs_geom, rcs.name as rcs_name
+FROM lcs JOIN rcs ON lcs.cellId = rcs.cellId
+```
+
+
+### 3. Optional: Refine the result
+
+Due to the nature of S2 Cellid, the equi-join results might have a few
false-positives depending on the S2 level you choose. A smaller level indicates
bigger cells, less exploded rows, but more false positives.
+
+To ensure the correctness, you can use [Spatial Predicate](../Predicate/) to
filter out them.
+
+```sql
+SELECT *
+FROM joinresult
+WHERE ST_Contains(lcs.geom, rcs.geom)
+```
+
+!!!tip
+ You can skip this step if you don't need 100% accuracy and want faster
query speed.
+
+### 4. Optional: De-duplcate
+
+Due to the explode function used when we generate S2 Cell Ids, the resulting
DataFrame may have several duplicate <lcs_geom, rcs_geom> matches. You can
remove them by performing a GroupBy query.
+
+```sql
+SELECT lcs_id, rcs_id , first(lcs_geom), first(lcs_name), first(rcs_geom),
first(rcs_name)
+FROM joinresult
+GROUP BY (lcs_id, rcs_id)
+```
+
+The `first` function is to take the first value from a number of duplicate
values.
+
+If you don't have a unique id for each geometry, you can also group by
geometry itself. See below:
+
+```sql
+SELECT lcs_geom, rcs_geom, first(lcs_name), first(rcs_name)
+FROM joinresult
+GROUP BY (lcs_geom, rcs_geom)
+```
+
+!!!note
+ If you are doing point-in-polygon join, this is not a problem and you
can safely discard this issue. This issue only happens when you do
polygon-polygon, polygon-linestring, linestring-linestring join.
+
-Introduction: Given a join query and a predicate in the same WHERE clause,
first executes the Predicate as a filter, then executes the join query*
+## Regular spatial predicate pushdown
+Introduction: Given a join query and a predicate in the same WHERE clause,
first executes the Predicate as a filter, then executes the join query.
Spark SQL Example:
@@ -143,7 +220,7 @@ RangeJoin polygonshape#20: geometry, pointshape#43:
geometry, false
+- *FileScan csv
```
-### GeoParquet
+## Push spatial predicates to GeoParquet
Sedona supports spatial predicate push-down for GeoParquet files. When spatial
filters were applied to dataframes backed by GeoParquet files, Sedona will use
the
[`bbox` properties in the
metadata](https://github.com/opengeospatial/geoparquet/blob/v1.0.0-beta.1/format-specs/geoparquet.md#bbox)
diff --git a/docs/setup/compile.md b/docs/setup/compile.md
index 4680091a..775eaf8a 100644
--- a/docs/setup/compile.md
+++ b/docs/setup/compile.md
@@ -4,7 +4,7 @@
## Compile Scala / Java source code
-Sedona Scala/Java code is a project with four modules, core, sql, viz and
python adapter. Each module is a Scala/Java mixed project which is managed by
Apache Maven 3.
+Sedona Scala/Java code is a project with multiple modules. Each module is a
Scala/Java mixed project which is managed by Apache Maven 3.
* Make sure your Linux/Mac machine has Java 1.8, Apache Maven 3.3.1+, and
Python3. The compilation of Sedona is not tested on Windows machine.
@@ -43,7 +43,7 @@ To compile all modules, please make sure you are in the root
folder of all modul
```
!!!tip
- To get the Sedona Python-adapter jar with all GeoTools jars included,
simply append `-Dgeotools` option. The command is like this:`mvn clean install
-DskipTests -Dscala=2.12 -Dspark=3.0 -Dgeotools`
+ To get the Sedona Spark Shaded jar with all GeoTools jars included,
simply append `-Dgeotools` option. The command is like this:`mvn clean install
-DskipTests -Dscala=2.12 -Dspark=3.0 -Dgeotools`
### Download staged jars
@@ -58,9 +58,9 @@ For example,
export SPARK_HOME=$PWD/spark-3.0.1-bin-hadoop2.7
export PYTHONPATH=$SPARK_HOME/python
```
-2. Compile the Sedona Scala and Java code with `-Dgeotools` and then copy the
==sedona-python-adapter-{{ sedona.current_version }}.jar== to
==SPARK_HOME/jars/== folder.
+2. Compile the Sedona Scala and Java code with `-Dgeotools` and then copy the
==sedona-spark-shaded-{{ sedona.current_version }}.jar== to
==SPARK_HOME/jars/== folder.
```
-cp python-adapter/target/sedona-python-adapter-xxx.jar SPARK_HOME/jars/
+cp spark-shaded/target/sedona-spark-shaded-xxx.jar SPARK_HOME/jars/
```
3. Install the following libraries
```
diff --git a/docs/setup/release-notes.md b/docs/setup/release-notes.md
index 715995a4..99179c5a 100644
--- a/docs/setup/release-notes.md
+++ b/docs/setup/release-notes.md
@@ -13,9 +13,13 @@ Sedona 1.4.0 is compiled against, Spark 3.3 / Flink 1.12,
Java 8.
* [X] **Sedona Spark** New RasterUDT added to Sedona GeoTiff reader.
* [X] **Sedona Spark** A number of bug fixes and improvement to the Sedona R
module.
+### API change
+
+* **Sedona Spark & Flink** Packaging strategy changed. See [Maven
Coordinate](../maven-coordinates). Please change your Sedona dependencies if
needed. We recommend `sedona-spark-shaded-3.0_2.12-1.4.0` and
`sedona-flink-shaded-3.0_2.12-1.4.0`
+* **Sedona Spark & Flink** GeoTools-wrapper version upgraded. Please use
`geotools-wrapper-1.4.0-28.2`.
+
### Behavior change
-* **Sedona Spark & Flink** Packaging strategy changed. See [Maven
Coordinate](../maven-coordinates). Please change your Sedona dependencies if
needed.
* **Sedona Flink** Sedona Flink no longer outputs any LinearRing type
geometry. All LinearRing are changed to LineString.
* **Sedona Spark** Join optimization strategy changed. Sedona no longer
optimizes spatial join when use a spatial predicate together with a equijoin
predicate. By default, it prefers equijoin whenever possible. SedonaConf adds a
config option called `sedona.join.optimizationmode`, it can be configured as
one of the following values:
* `all`: optimize all joins having spatial predicate in join
conditions. This was the behavior of Apache Sedona prior to 1.4.0.
@@ -25,8 +29,6 @@ Sedona 1.4.0 is compiled against, Spark 3.3 / Flink 1.12,
Java 8.
When `sedona.join.optimizationmode` is configured as `nonequi`, it won't
optimize join queries such as `SELECT * FROM A, B WHERE A.x = B.x AND
ST_Contains(A.geom, B.geom)`, since it is an equi-join with equi-condition `A.x
= B.x`. Sedona will optimize for `SELECT * FROM A, B WHERE A.x = B.x AND
ST_Contains(A.geom, B.geom)`
-
-
### Bug
<ul>
diff --git a/docs/tutorial/jupyter-notebook.md
b/docs/tutorial/jupyter-notebook.md
index 318c073c..bc0a91e2 100644
--- a/docs/tutorial/jupyter-notebook.md
+++ b/docs/tutorial/jupyter-notebook.md
@@ -9,7 +9,7 @@ Please use the following steps to run Jupyter notebook with
Pipenv on your machi
1. Clone Sedona GitHub repo or download the source code
2. Install Sedona Python from PyPI or GitHub source: Read [Install Sedona
Python](../../setup/install-python/#install-sedona) to learn.
-3. Prepare python-adapter jar: Read [Install Sedona
Python](../../setup/install-python/#prepare-python-adapter-jar) to learn.
+3. Prepare spark-shaded jar: Read [Install Sedona
Python](../../setup/install-python/#prepare-spark-shaded-jar) to learn.
4. Setup pipenv python version. Please use your desired Python version.
```bash
cd binder
diff --git a/docs/tutorial/python-vector-osm.md
b/docs/tutorial/python-vector-osm.md
index 23d10d70..0e20414a 100644
--- a/docs/tutorial/python-vector-osm.md
+++ b/docs/tutorial/python-vector-osm.md
@@ -40,7 +40,7 @@ spark = SparkSession.\
config('spark.kryoserializer.buffer.max', 2047).\
config("spark.serializer", KryoSerializer.getName).\
config("spark.kryo.registrator", SedonaKryoRegistrator.getName).\
- config("spark.jars.packages",
"org.apache.sedona:sedona-python-adapter-3.0_2.12:1.1.0-incubating,org.datasyslab:geotools-wrapper:1.1.0-25.2")
.\
+ config("spark.jars.packages",
"org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.0,org.datasyslab:geotools-wrapper:1.4.0-28.2")
.\
enableHiveSupport().\
getOrCreate()
diff --git a/docs/tutorial/sql-pure-sql.md b/docs/tutorial/sql-pure-sql.md
index cdc6097c..cbc1a8a8 100644
--- a/docs/tutorial/sql-pure-sql.md
+++ b/docs/tutorial/sql-pure-sql.md
@@ -8,7 +8,7 @@ SedonaSQL supports SQL/MM Part3 Spatial SQL Standard. Detailed
SedonaSQL APIs ar
Start `spark-sql` as following (replace `<VERSION>` with actual version, like,
`1.0.1-incubating`):
```sh
-spark-sql --packages
org.apache.sedona:sedona-python-adapter-3.0_2.12:<VERSION>,org.apache.sedona:sedona-viz-3.0_2.12:<VERSION>,org.datasyslab:geotools-wrapper:geotools-24.0
\
+spark-sql --packages
org.apache.sedona:sedona-spark-shaded-3.0_2.12:<VERSION>,org.apache.sedona:sedona-viz-3.0_2.12:<VERSION>,org.datasyslab:geotools-wrapper:geotools-24.0
\
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf
spark.kryo.registrator=org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
\
--conf
spark.sql.extensions=org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
diff --git a/mkdocs.yml b/mkdocs.yml
index eb634c97..2e41e4b2 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -54,7 +54,7 @@ nav:
- Predicate: api/sql/Predicate.md
- Aggregate function: api/sql/AggregateFunction.md
- DataFrame Style functions: api/sql/DataFrameAPI.md
- - SedonaSQL query optimizer: api/sql/Optimizer.md
+ - Query optimization: api/sql/Optimizer.md
- Raster data:
- Raster loader: api/sql/Raster-loader.md
- Raster writer: api/sql/Raster-writer.md
@@ -62,11 +62,10 @@ nav:
- Parameter: api/sql/Parameter.md
- RDD (core):
- Scala/Java doc: api/java-api.md
- - Python doc: api/python-api.md
- - R doc: api/r-api.md
- Viz:
- DataFrame/SQL: api/viz/sql.md
- RDD: api/viz/java-api.md
+ - Sedona R: api/rdocs
- Sedona with Apache Flink:
- SQL:
- Overview: api/flink/Overview.md