This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/master by this push:
new e9d1f6fbc3 [DOCS] Improve docs and fix deadlinks (#1846)
e9d1f6fbc3 is described below
commit e9d1f6fbc3979491464087efa6a5644f8c26a61c
Author: Jia Yu <[email protected]>
AuthorDate: Mon Mar 10 00:57:20 2025 -0700
[DOCS] Improve docs and fix deadlinks (#1846)
---
.github/linters/.markdown-lint.yml | 6 +
docs/api/sql/Function.md | 2 +-
docs/api/sql/Raster-map-algebra.md | 4 +-
docs/api/sql/Spider.md | 2 +-
docs/setup/databricks.md | 57 +----
docs/setup/docker.md | 4 +-
docs/setup/maven-coordinates.md | 46 ----
docs/setup/release-notes.md | 2 +-
docs/tutorial/files/geoparquet-sedona-spark.md | 10 +-
docs/tutorial/files/shapefiles-sedona-spark.md | 8 +-
.../files/stac-sedona-spark.md} | 142 +++++------
docs/tutorial/python-vector-osm.md | 159 ------------
docs/tutorial/raster.md | 83 +------
docs/tutorial/rdd.md | 4 +-
docs/tutorial/snowflake/sql.md | 2 +-
docs/tutorial/sql.md | 271 +++------------------
mkdocs.yml | 15 +-
17 files changed, 135 insertions(+), 682 deletions(-)
diff --git a/.github/linters/.markdown-lint.yml
b/.github/linters/.markdown-lint.yml
index 7c2ae00edc..9ce345ed0f 100644
--- a/.github/linters/.markdown-lint.yml
+++ b/.github/linters/.markdown-lint.yml
@@ -17,6 +17,9 @@
# https://github.com/DavidAnson/markdownlint#rules--aliases
+# ul-style Unordered list style
+MD004: false
+
# ul-indent - Unordered list indentation
MD007: false
@@ -55,3 +58,6 @@ MD041: false
# code-block-style - Code block style
MD046: false
+
+# link-fragments Link fragments should be valid
+MD051: false
diff --git a/docs/api/sql/Function.md b/docs/api/sql/Function.md
index 51f63f3a13..2dc28f21ce 100644
--- a/docs/api/sql/Function.md
+++ b/docs/api/sql/Function.md
@@ -460,7 +460,7 @@ POINT ZM(1 1 1 1)
## ST_AsGeoJSON
!!!note
- This method is not recommended. Please use [Sedona GeoJSON data
source](../../tutorial/sql.md#save-as-geojson) to write GeoJSON files.
+ This method is not recommended. Please use [Sedona GeoJSON data
source](../../tutorial/sql.md#save-geojson) to write GeoJSON files.
Introduction: Return the [GeoJSON](https://geojson.org/) string representation
of a geometry
diff --git a/docs/api/sql/Raster-map-algebra.md
b/docs/api/sql/Raster-map-algebra.md
index 5b22280ee6..70b7e52eef 100644
--- a/docs/api/sql/Raster-map-algebra.md
+++ b/docs/api/sql/Raster-map-algebra.md
@@ -34,7 +34,7 @@ RS_MapAlgebra(rast: Raster, pixelType: String, script:
String, [noDataValue: Dou
* `rast`: The raster to apply the map algebra expression to.
* `pixelType`: The data type of the output raster. This can be one of `D`
(double), `F` (float), `I` (integer), `S` (short), `US` (unsigned short) or `B`
(byte). If specified `NULL`, the output raster will have the same data type as
the input raster.
-* `script`: The map algebra script. [Refer here for more details on the
format.](#:~:text=The Jiffle script is,current output pixel value)
+* `script`: The map algebra script. [Refer here for more details on the
format.](https://github.com/geosolutions-it/jai-ext/wiki/Jiffle)
* `noDataValue`: (Optional) The nodata value of the output raster.
As of version `v1.5.1`, the `RS_MapAlgebra` function allows two raster column
inputs, with multi-band rasters supported. The function accepts 5 parameters:
@@ -46,7 +46,7 @@ RS_MapAlgebra(rast0: Raster, rast1: Raster, pixelType:
String, script: String, n
* `rast0`: The first raster to apply the map algebra expression to.
* `rast1`: The second raster to apply the map algebra expression to.
* `pixelType`: The data type of the output raster. This can be one of `D`
(double), `F` (float), `I` (integer), `S` (short), `US` (unsigned short) or `B`
(byte). If specified `NULL`, the output raster will have the same data type as
the input raster.
-* `script`: The map algebra script. [Refer here for more details on the
format.](#:~:text=The Jiffle script is,current output pixel value)
+* `script`: The map algebra script. [Refer here for more details on the
format.](https://github.com/geosolutions-it/jai-ext/wiki/Jiffle)
* `noDataValue`: (Not optional) The nodata value of the output raster, `null`
is allowed.
Spark SQL Example for two raster input `RS_MapAlgebra`:
diff --git a/docs/api/sql/Spider.md b/docs/api/sql/Spider.md
index 5dfa6569fe..207259d10e 100644
--- a/docs/api/sql/Spider.md
+++ b/docs/api/sql/Spider.md
@@ -21,7 +21,7 @@ Sedona offers a spatial data generator called Spider. It is a
data source that g
## Quick Start
-Once you have your [`SedonaContext` object created](../Overview#quick-start),
you can create a DataFrame with the `spider` data source.
+Once you have your [`SedonaContext` object created](Overview.md#quick-start),
you can create a DataFrame with the `spider` data source.
```python
df_random_points = sedona.read.format("spider").load(n=1000,
distribution="uniform")
diff --git a/docs/setup/databricks.md b/docs/setup/databricks.md
index 0a9e7cda9e..d0a8d332d9 100644
--- a/docs/setup/databricks.md
+++ b/docs/setup/databricks.md
@@ -17,60 +17,10 @@
under the License.
-->
-Please pay attention to the Spark version postfix and Scala version postfix on
our [Maven Coordinate page](maven-coordinates.md). Databricks Spark and Apache
Spark's compatibility can be found
[here](https://docs.databricks.com/en/release-notes/runtime/index.html).
-
-## Community edition (free-tier)
-
-You just need to install the Sedona jars and Sedona Python on Databricks using
Databricks default web UI. Then everything will work.
-
-### Install libraries
-
-1) From the Libraries tab install from Maven Coordinates
-
-```
-org.apache.sedona:sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}
-org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}
-```
-
-2) For enabling python support, from the Libraries tab install from PyPI
-
-```
-apache-sedona=={{ sedona.current_version }}
-geopandas==1.0.1
-keplergl==0.3.7
-pydeck==0.9.1
-```
-
-### Initialize
-
-After you have installed the libraries and started the cluster, you can
initialize the Sedona `ST_*` functions and types by running from your code:
-
-(scala)
-
-```scala
-import org.apache.sedona.sql.utils.SedonaSQLRegistrator
-SedonaSQLRegistrator.registerAll(spark)
-```
-
-(or python)
-
-```python
-from sedona.register.geo_registrator import SedonaRegistrator
-
-SedonaRegistrator.registerAll(spark)
-```
-
-## Advanced editions
-
-In Databricks advanced editions, you need to install Sedona via [cluster
init-scripts](https://docs.databricks.com/clusters/init-scripts.html) as
described below. We recommend Databricks 10.x+. Sedona is not guaranteed to be
100% compatible with `Databricks photon acceleration`. Sedona requires Spark
internal APIs to inject many optimization strategies, which sometimes is not
accessible in `Photon`.
-
-In Spark 3.2, `org.apache.spark.sql.catalyst.expressions.Generator` class
added a field `nodePatterns`. Any SQL functions that rely on Generator class
may have issues if compiled for a runtime with a differing spark version. For
Sedona, those functions are:
-
-* ST_MakeValid
-* ST_SubDivideExplode
+In Databricks advanced editions, you need to install Sedona via [cluster
init-scripts](https://docs.databricks.com/clusters/init-scripts.html) as
described below. Sedona is not guaranteed to be 100% compatible with
`Databricks photon acceleration`. Sedona requires Spark internal APIs to inject
many optimization strategies, which sometimes is not accessible in `Photon`.
!!!note
- The following steps use DBR including Apache Spark 3.4.x as an example.
Please change the Spark version according to your DBR version.
+ The following steps use DBR including Apache Spark 3.4.x as an example.
Please change the Spark version according to your DBR version. Please pay
attention to the Spark version postfix and Scala version postfix on our [Maven
Coordinate page](maven-coordinates.md). Databricks Spark and Apache Spark's
compatibility can be found
[here](https://docs.databricks.com/en/release-notes/runtime/index.html).
### Download Sedona jars
@@ -91,9 +41,6 @@ Of course, you can also do the steps above manually.
### Create an init script
-!!!warning
- Starting from December 2023, Databricks has disabled all DBFS based init
script (/dbfs/XXX/<script-name>.sh). So you will have to store the init script
from a workspace level (`/Workspace/Users/<user-name>/<script-name>.sh`) or
Unity Catalog volume
(`/Volumes/<catalog>/<schema>/<volume>/<path-to-script>/<script-name>.sh`).
Please see [Databricks init
scripts](https://docs.databricks.com/en/init-scripts/cluster-scoped.html#configure-a-cluster-scoped-init-script-using-the-ui)
for more [...]
-
!!!note
If you are creating a Shared cluster, you won't be able to use init
scripts and jars stored under `Workspace`. Please instead store them in
`Volumes`. The overall process should be the same.
diff --git a/docs/setup/docker.md b/docs/setup/docker.md
index 4af395020b..a5f708368e 100644
--- a/docs/setup/docker.md
+++ b/docs/setup/docker.md
@@ -68,7 +68,7 @@ This command will bind the container's ports 8888, 8080,
8081, 4040, 8085 to the
Example 2:
```bash
-docker run -p 8888:8888 -p 8080:8080 -p 8081:8081 -p 4040:4040 -p 8085:8085
apache/sedona:{{ sedona.current_version }}
+docker run -d -e -p 8888:8888 -p 8080:8080 -p 8081:8081 -p 4040:4040 -p
8085:8085 apache/sedona:{{ sedona.current_version }}
```
This command will start a container with 4GB RAM for the driver and 4GB RAM
for the executor and use Sedona {{ sedona.current_version }} image.
@@ -91,7 +91,7 @@ docker run -d -e DRIVER_MEM=6g -e EXECUTOR_MEM=8g \
### Start coding
-Open your browser and go to [http://localhost:8888/](http://localhost:8888/)
to start coding with Sedona. You can also access Apache Zeppelin at
[http://localhost:8085/classic/](http://localhost:8085/classic/ ) using your
browser.
+Open your browser and go to [http://localhost:8888/](http://localhost:8888/)
to start coding with Sedona in Jupyter Notebook. You can also access Apache
Zeppelin at [http://localhost:8085/classic/](http://localhost:8085/classic/ )
using your browser.
### Notes
diff --git a/docs/setup/maven-coordinates.md b/docs/setup/maven-coordinates.md
index 90d5a2c419..e54870be7b 100644
--- a/docs/setup/maven-coordinates.md
+++ b/docs/setup/maven-coordinates.md
@@ -169,52 +169,6 @@ The optional GeoTools library is required if you want to
use CRS transformation,
</dependency>
```
-### netCDF-Java 5.4.2
-
-This is required only if you want to read HDF/NetCDF files using
`RS_FromNetCDF`. Note that this JAR is not in Maven Central so you will need to
add this repository to your pom.xml or build.sbt, or specify the URL in Spark
Config `spark.jars.repositories` or spark-submit `--repositories` option.
-
-!!!warning
- This jar was a required dependency due to a bug in Sedona 1.5.1. You
will need to specify the URL of the repository in `spark.jars.repositories` if
you use 1.5.1. This has been fixed in Sedona 1.5.2 and later.
-
-Under BSD 3-clause (compatible with Apache 2.0 license)
-
-!!! abstract "Add HDF/NetCDF dependency"
-
- === "Sedona 1.3.1+"
-
- Add unidata repo to your pom.xml
-
- ```
- <repositories>
- <repository>
- <id>unidata-all</id>
- <name>Unidata All</name>
-
<url>https://artifacts.unidata.ucar.edu/repository/unidata-all/</url>
- </repository>
- </repositories>
- ```
-
- Then add cdm-core to your POM dependency.
-
- ```xml
- <dependency>
- <groupId>edu.ucar</groupId>
- <artifactId>cdm-core</artifactId>
- <version>5.4.2</version>
- </dependency>
- ```
-
- === "Before Sedona 1.3.1"
-
- ```xml
- <!--
https://mvnrepository.com/artifact/org.datasyslab/sernetcdf -->
- <dependency>
- <groupId>org.datasyslab</groupId>
- <artifactId>sernetcdf</artifactId>
- <version>0.1.0</version>
- </dependency>
- ```
-
## Use Sedona unshaded jars
!!!warning
diff --git a/docs/setup/release-notes.md b/docs/setup/release-notes.md
index 7a5666846a..36c620beac 100644
--- a/docs/setup/release-notes.md
+++ b/docs/setup/release-notes.md
@@ -1164,7 +1164,7 @@ Sedona 1.4.0 is compiled against, Spark 3.3 / Flink 1.12,
Java 8.
* [X] **Sedona Spark & Flink** Serialize and deserialize geometries 3 - 7X
faster
* [X] **Sedona Spark & Flink** Google S2 based spatial join for fast
approximate point-in-polygon join. See [Join query in
Spark](../api/sql/Optimizer.md#google-s2-based-approximate-equi-join) and [Join
query in Flink](../tutorial/flink/sql.md#join-query)
-* [X] **Sedona Spark** Pushdown spatial predicate on GeoParquet to reduce
memory consumption by 10X: see
[explanation](../api/sql/Optimizer.md#Push-spatial-predicates-to-GeoParquet)
+* [X] **Sedona Spark** Pushdown spatial predicate on GeoParquet to reduce
memory consumption by 10X: see
[explanation](../api/sql/Optimizer.md#push-spatial-predicates-to-geoparquet)
* [X] **Sedona Spark** Automatically use broadcast index spatial join for
small datasets
* [X] **Sedona Spark** New RasterUDT added to Sedona GeoTiff reader.
* [X] **Sedona Spark** A number of bug fixes and improvement to the Sedona R
module.
diff --git a/docs/tutorial/files/geoparquet-sedona-spark.md
b/docs/tutorial/files/geoparquet-sedona-spark.md
index 28da219474..11d95d9c6d 100644
--- a/docs/tutorial/files/geoparquet-sedona-spark.md
+++ b/docs/tutorial/files/geoparquet-sedona-spark.md
@@ -76,7 +76,7 @@ df.show(truncate=False)
Here are the results:
```
-+---+---------------------+
++---+---------------------+
|id |geometry |
+---+---------------------+
|a |LINESTRING (2 5, 6 1)|
@@ -199,10 +199,10 @@ The value of `geoparquet.crs` and
`geoparquet.crs.<column_name>` can be one of t
* `""` (empty string): Omit the `crs` field. This implies that the CRS is
[OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84) for CRS-aware
implementations.
* `"{...}"` (PROJJSON string): The `crs` field will be set as the PROJJSON
object representing the Coordinate Reference System (CRS) of the geometry. You
can find the PROJJSON string of a specific CRS from here: https://epsg.io/
(click the JSON option at the bottom of the page). You can also customize your
PROJJSON string as needed.
-Please note that Sedona currently cannot set/get a projjson string to/from a
CRS. Its geoparquet reader will ignore the projjson metadata and you will have
to set your CRS via [`ST_SetSRID`](../api/sql/Function.md#st_setsrid) after
reading the file.
+Please note that Sedona currently cannot set/get a projjson string to/from a
CRS. Its geoparquet reader will ignore the projjson metadata and you will have
to set your CRS via [`ST_SetSRID`](../../api/sql/Function.md#st_setsrid) after
reading the file.
Its geoparquet writer will not leverage the SRID field of a geometry so you
will have to always set the `geoparquet.crs` option manually when writing the
file, if you want to write a meaningful CRS field.
-Due to the same reason, Sedona geoparquet reader and writer do NOT check the
axis order (lon/lat or lat/lon) and assume they are handled by the users
themselves when writing / reading the files. You can always use
[`ST_FlipCoordinates`](../api/sql/Function.md#st_flipcoordinates) to swap the
axis order of your geometries.
+Due to the same reason, Sedona geoparquet reader and writer do NOT check the
axis order (lon/lat or lat/lon) and assume they are handled by the users
themselves when writing / reading the files. You can always use
[`ST_FlipCoordinates`](../../api/sql/Function.md#st_flipcoordinates) to swap
the axis order of your geometries.
## Save GeoParquet with Covering Metadata
@@ -231,7 +231,7 @@
df_bbox.write.format("geoparquet").option("geoparquet.covering.geometry", "bbox"
## Sort then Save GeoParquet
-To maximize the performance of Sedona GeoParquet filter pushdown, we suggest
that you sort the data by their geohash values (see
[ST_GeoHash](../api/sql/Function.md#st_geohash)) and then save as a GeoParquet
file. An example is as follows:
+To maximize the performance of Sedona GeoParquet filter pushdown, we suggest
that you sort the data by their geohash values (see
[ST_GeoHash](../../api/sql/Function.md#st_geohash)) and then save as a
GeoParquet file. An example is as follows:
```
SELECT col1, col2, geom, ST_GeoHash(geom, 5) as geohash
@@ -253,7 +253,7 @@ Let’s look at an example of a dataset with points and three
bounding boxes.
Now, let’s apply a spatial filter to read points within a particular area:
-
+
Here is the query:
diff --git a/docs/tutorial/files/shapefiles-sedona-spark.md
b/docs/tutorial/files/shapefiles-sedona-spark.md
index 3b24349b68..a7df23c521 100644
--- a/docs/tutorial/files/shapefiles-sedona-spark.md
+++ b/docs/tutorial/files/shapefiles-sedona-spark.md
@@ -196,11 +196,11 @@ Due to these limitations, other options are worth
investigating.
There are a variety of other file formats that are good for geometric data:
* Iceberg
-* [GeoParquet](../geoparquet-sedona-spark)
+* [GeoParquet](geoparquet-sedona-spark.md)
* FlatGeoBuf
-* [GeoPackage](../geopackage-sedona-spark)
-* [GeoJSON](../geojson-sedona-spark)
-* [CSV](../csv-geometry-sedona-spark)
+* [GeoPackage](geopackage-sedona-spark.md)
+* [GeoJSON](geojson-sedona-spark.md)
+* [CSV](csv-geometry-sedona-spark.md)
* GeoTIFF
## Why Sedona does not support Shapefile writes
diff --git a/docs/api/sql/Stac.md b/docs/tutorial/files/stac-sedona-spark.md
similarity index 85%
rename from docs/api/sql/Stac.md
rename to docs/tutorial/files/stac-sedona-spark.md
index 8d56644e5e..062e6c5f55 100644
--- a/docs/api/sql/Stac.md
+++ b/docs/tutorial/files/stac-sedona-spark.md
@@ -17,6 +17,8 @@
under the License.
-->
+# STAC catalog with Apache Sedona and Spark
+
The STAC data source allows you to read data from a SpatioTemporal Asset
Catalog (STAC) API. The data source supports reading STAC items and collections.
## Usage
@@ -108,29 +110,29 @@ root
+------------+--------------------+-------+--------------------+--------------------+--------------------+-----+-----------+--------------------+--------------+------------+--------------------+--------------------+-----------+-----------+-------------+-------+----+--------------------+--------------------+--------------------+
```
-# Filter Pushdown
+## Filter Pushdown
The STAC data source supports predicate pushdown for spatial and temporal
filters. The data source can push down spatial and temporal filters to the
underlying data source to reduce the amount of data that needs to be read.
-## Spatial Filter Pushdown
+### Spatial Filter Pushdown
Spatial filter pushdown allows the data source to apply spatial predicates
(e.g., st_contains, st_intersects) directly at the data source level, reducing
the amount of data transferred and processed.
-## Temporal Filter Pushdown
+### Temporal Filter Pushdown
Temporal filter pushdown allows the data source to apply temporal predicates
(e.g., BETWEEN, >=, <=) directly at the data source level, similarly reducing
the amount of data transferred and processed.
-# Examples
+## Examples
Here are some examples demonstrating how to query a STAC data source that is
loaded into a table named `STAC_TABLE`.
-## SQL Select Without Filters
+### SQL Select Without Filters
```sql
SELECT id, datetime as dt, geometry, bbox FROM STAC_TABLE
```
-## SQL Select With Temporal Filter
+### SQL Select With Temporal Filter
```sql
SELECT id, datetime as dt, geometry, bbox
@@ -140,7 +142,7 @@ SELECT id, datetime as dt, geometry, bbox FROM STAC_TABLE
In this example, the data source will push down the temporal filter to the
underlying data source.
-## SQL Select With Spatial Filter
+### SQL Select With Spatial Filter
```sql
SELECT id, geometry
@@ -150,7 +152,7 @@ In this example, the data source will push down the
temporal filter to the under
In this example, the data source will push down the spatial filter to the
underlying data source.
-## Sedona Configuration for STAC Reader
+### Sedona Configuration for STAC Reader
When using the STAC reader in Sedona, several configuration options can be set
to control the behavior of the reader. These configurations are typically set
in a `Map[String, String]` and passed to the reader. Below are the key sedona
configuration options:
@@ -192,73 +194,13 @@ These configurations can be combined into a single
`Map[String, String]` and pas
These options above provide fine-grained control over how the STAC data is
read and processed in Sedona.
-# Python API
+## Python API
The Python API allows you to interact with a SpatioTemporal Asset Catalog
(STAC) API using the Client class. This class provides methods to open a
connection to a STAC API, retrieve collections, and search for items with
various filters.
-## Client Class
-
-## Methods
-
-### `open(url: str) -> Client`
-
-Opens a connection to the specified STAC API URL.
-
-**Parameters:**
-
-- `url` (*str*): The URL of the STAC API to connect to.
- **Example:** `"https://planetarycomputer.microsoft.com/api/stac/v1"`
-
-**Returns:**
-
-- `Client`: An instance of the `Client` class connected to the specified URL.
-
----
-
-### `get_collection(collection_id: str) -> CollectionClient`
-
-Retrieves a collection client for the specified collection ID.
-
-**Parameters:**
-
-- `collection_id` (*str*): The ID of the collection to retrieve.
- **Example:** `"aster-l1t"`
-
-**Returns:**
-
-- `CollectionClient`: An instance of the `CollectionClient` class for the
specified collection.
-
----
-
-### `search(*ids: Union[str, list], collection_id: str, bbox: Optional[list] =
None, datetime: Optional[Union[str, datetime.datetime, list]] = None,
max_items: Optional[int] = None, return_dataframe: bool = True) ->
Union[Iterator[PyStacItem], DataFrame]`
-
-Searches for items in the specified collection with optional filters.
-
-**Parameters:**
-
-- `ids` (*Union[str, list]*): A variable number of item IDs to filter the
items.
- **Example:** `"item_id1"` or `["item_id1", "item_id2"]`
-- `collection_id` (*str*): The ID of the collection to search in.
- **Example:** `"aster-l1t"`
-- `bbox` (*Optional[list]*): A list of bounding boxes for filtering the items.
Each bounding box is represented as a list of four float values: `[min_lon,
min_lat, max_lon, max_lat]`.
- **Example:** `[[ -180.0, -90.0, 180.0, 90.0 ]]`
-- `datetime` (*Optional[Union[str, datetime.datetime, list]]*): A single
datetime, RFC 3339-compliant timestamp, or a list of date-time ranges for
filtering the items.
- **Example:**
- - `"2020-01-01T00:00:00Z"`
- - `datetime.datetime(2020, 1, 1)`
- - `[["2020-01-01T00:00:00Z", "2021-01-01T00:00:00Z"]]`
-- `max_items` (*Optional[int]*): The maximum number of items to return from
the search, even if there are more matching results.
- **Example:** `100`
-- `return_dataframe` (*bool*): If `True` (default), return the result as a
Spark DataFrame instead of an iterator of `PyStacItem` objects.
- **Example:** `True`
-
-**Returns:**
+### Sample Code
-- *Union[Iterator[PyStacItem], DataFrame]*: An iterator of `PyStacItem`
objects or a Spark DataFrame that matches the specified filters.
-
-## Sample Code
-
-### Initialize the Client
+#### Initialize the Client
```python
from sedona.stac.client import Client
@@ -267,7 +209,7 @@ from sedona.stac.client import Client
client = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
```
-### Search Items on a Collection Within a Year
+#### Search Items on a Collection Within a Year
```python
items = client.search(
@@ -275,7 +217,7 @@ items = client.search(
)
```
-### Search Items on a Collection Within a Month and Max Items
+#### Search Items on a Collection Within a Month and Max Items
```python
items = client.search(
@@ -283,7 +225,7 @@ items = client.search(
)
```
-### Search Items with Bounding Box and Interval
+#### Search Items with Bounding Box and Interval
```python
items = client.search(
@@ -295,14 +237,14 @@ items = client.search(
)
```
-### Search Multiple Items with Multiple Bounding Boxes
+#### Search Multiple Items with Multiple Bounding Boxes
```python
bbox_list = [[-180.0, -90.0, 180.0, 90.0], [-100.0, -50.0, 100.0, 50.0]]
items = client.search(collection_id="aster-l1t", bbox=bbox_list,
return_dataframe=False)
```
-### Search Items and Get DataFrame as Return with Multiple Intervals
+#### Search Items and Get DataFrame as Return with Multiple Intervals
```python
interval_list = [
@@ -315,7 +257,7 @@ df = client.search(
df.show()
```
-### Save Items in DataFrame to GeoParquet with Both Bounding Boxes and
Intervals
+#### Save Items in DataFrame to GeoParquet with Both Bounding Boxes and
Intervals
```python
# Save items in DataFrame to GeoParquet with both bounding boxes and intervals
@@ -326,7 +268,51 @@ client.get_collection("aster-l1t").save_to_geoparquet(
These examples demonstrate how to use the Client class to search for items in
a STAC collection with various filters and return the results as either an
iterator of PyStacItem objects or a Spark DataFrame.
-# References
+### Methods
+
+**`open(url: str) -> Client`**
+Opens a connection to the specified STAC API URL.
+
+Parameters:
+
+* `url` (*str*): The URL of the STAC API to connect to. Example:
`"https://planetarycomputer.microsoft.com/api/stac/v1"`
+
+Returns:
+
+* `Client`: An instance of the `Client` class connected to the specified URL.
+
+---
+
+**`get_collection(collection_id: str) -> CollectionClient`**
+Retrieves a collection client for the specified collection ID.
+
+Parameters:
+
+* `collection_id` (*str*): The ID of the collection to retrieve. Example:
`"aster-l1t"`
+
+Returns:
+
+* `CollectionClient`: An instance of the `CollectionClient` class for the
specified collection.
+
+---
+
+**`search(*ids: Union[str, list], collection_id: str, bbox: Optional[list] =
None, datetime: Optional[Union[str, datetime.datetime, list]] = None,
max_items: Optional[int] = None, return_dataframe: bool = True) ->
Union[Iterator[PyStacItem], DataFrame]`**
+Searches for items in the specified collection with optional filters.
+
+Parameters:
+
+* `ids` (*Union[str, list]*): A variable number of item IDs to filter the
items. Example: `"item_id1"` or `["item_id1", "item_id2"]`
+* `collection_id` (*str*): The ID of the collection to search in. Example:
`"aster-l1t"`
+* `bbox` (*Optional[list]*): A list of bounding boxes for filtering the items,
represented as `[min_lon, min_lat, max_lon, max_lat]`. Example: `[[ -180.0,
-90.0, 180.0, 90.0 ]]`
+* `datetime` (*Optional[Union[str, datetime.datetime, list]]*): A single
datetime, RFC 3339-compliant timestamp, or a list of date-time ranges. Example:
`"2020-01-01T00:00:00Z"`, `datetime.datetime(2020, 1, 1)`,
`[["2020-01-01T00:00:00Z", "2021-01-01T00:00:00Z"]]`
+* `max_items` (*Optional[int]*): The maximum number of items to return.
Example: `100`
+* `return_dataframe` (*bool*): If `True` (default), return the result as a
Spark DataFrame instead of an iterator of `PyStacItem` objects. Example: `True`
+
+Returns:
+
+* *Union[Iterator[PyStacItem], DataFrame]*: An iterator of `PyStacItem`
objects or a Spark DataFrame that matches the specified filters.
+
+## References
- STAC Specification: https://stacspec.org/
diff --git a/docs/tutorial/python-vector-osm.md
b/docs/tutorial/python-vector-osm.md
deleted file mode 100644
index 00f19f4322..0000000000
--- a/docs/tutorial/python-vector-osm.md
+++ /dev/null
@@ -1,159 +0,0 @@
-<!--
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements. See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership. The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied. See the License for the
- specific language governing permissions and limitations
- under the License.
- -->
-
-# Example of spark + sedona + hdfs with slave nodes and OSM vector data
consults
-
-```
-from IPython.display import display, HTML
-from pyspark.sql import SparkSession
-from pyspark import StorageLevel
-import pandas as pd
-from pyspark.sql.types import StructType, StructField,StringType, LongType,
IntegerType, DoubleType, ArrayType
-from pyspark.sql.functions import regexp_replace
-from sedona.register import SedonaRegistrator
-from sedona.utils import SedonaKryoRegistrator, KryoSerializer
-from pyspark.sql.functions import col, split, expr
-from pyspark.sql.functions import udf, lit
-from sedona.utils import SedonaKryoRegistrator, KryoSerializer
-from pyspark.sql.functions import col, split, expr
-from pyspark.sql.functions import udf, lit, flatten
-from pywebhdfs.webhdfs import PyWebHdfsClient
-from datetime import date
-from pyspark.sql.functions import monotonically_increasing_id
-import json
-```
-
-## Registering spark session, adding node executor configurations and sedona
registrator
-
-```
-spark = SparkSession.\
- builder.\
- appName("Overpass-API").\
- enableHiveSupport().\
- master("local[*]").\
- master("spark://spark-master:7077").\
- config("spark.executor.memory", "15G").\
- config("spark.driver.maxResultSize", "135G").\
- config("spark.sql.shuffle.partitions", "500").\
- config(' spark.sql.adaptive.coalescePartitions.enabled', True).\
- config('spark.sql.adaptive.enabled', True).\
- config('spark.sql.adaptive.coalescePartitions.initialPartitionNum', 125).\
- config("spark.sql.execution.arrow.pyspark.enabled", True).\
- config("spark.sql.execution.arrow.fallback.enabled", True).\
- config('spark.kryoserializer.buffer.max', 2047).\
- config("spark.serializer", KryoSerializer.getName).\
- config("spark.kryo.registrator", SedonaKryoRegistrator.getName).\
- config("spark.jars.packages",
"org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.4.0,org.datasyslab:geotools-wrapper:1.4.0-28.2")
.\
- enableHiveSupport().\
- getOrCreate()
-
-SedonaRegistrator.registerAll(spark)
-sc = spark.sparkContext
-```
-
-## Connecting to Overpass API to search and downloading data for saving into
HDFS
-
-```
-import requests
-import json
-
-overpass_url = "http://overpass-api.de/api/interpreter"
-overpass_query = """
-[out:json];
-area[name = "Foz do Iguaçu"];
-way(area)["highway"~""];
-out geom;
->;
-out skel qt;
-"""
-
-response = requests.get(overpass_url,
- params={'data': overpass_query})
-data = response.json()
-hdfs = PyWebHdfsClient(host='179.106.229.159',port='50070', user_name='root')
-file_name = "foz_roads_osm.json"
-hdfs.delete_file_dir(file_name)
-hdfs.create_file(file_name, json.dumps(data))
-
-```
-
-## Connecting spark sedona with saved hdfs file
-
-```
-path = "hdfs://776faf4d6a1e:8020/"+file_name
-df = spark.read.json(path, multiLine = "true")
-```
-
-## Consulting and organizing data for analysis
-
-```
-from pyspark.sql.functions import explode, arrays_zip
-
-df.createOrReplaceTempView("df")
-tb = spark.sql("select *, size(elements) total_nodes from df")
-tb.show(5)
-
-isolate_total_nodes = tb.select("total_nodes").toPandas()
-total_nodes = isolate_total_nodes["total_nodes"].iloc[0]
-print(total_nodes)
-
-isolate_ids = tb.select("elements.id").toPandas()
-ids = pd.DataFrame(isolate_ids["id"].iloc[0]).drop_duplicates()
-print(ids[0].iloc[1])
-
-formatted_df = tb\
-.withColumn("id", explode("elements.id"))
-
-formatted_df.show(5)
-
-formatted_df = tb\
-.withColumn("new", arrays_zip("elements.id", "elements.geometry",
"elements.nodes", "elements.tags"))\
-.withColumn("new", explode("new"))
-
-formatted_df.show(5)
-
-# formatted_df.printSchema()
-
-formatted_df =
formatted_df.select("new.0","new.1","new.2","new.3.maxspeed","new.3.incline","new.3.surface",
"new.3.name", "total_nodes")
-formatted_df =
formatted_df.withColumnRenamed("0","id").withColumnRenamed("1","geom").withColumnRenamed("2","nodes").withColumnRenamed("3","tags")
-formatted_df.createOrReplaceTempView("formatted_df")
-formatted_df.show(5)
-# TODO atualizar daqui para baixo para considerar a linha inteira na lógica
-points_tb = spark.sql("select geom, id from formatted_df where geom IS NOT
NULL")
-points_tb = points_tb\
-.withColumn("new", arrays_zip("geom.lat", "geom.lon"))\
-.withColumn("new", explode("new"))
-
-points_tb = points_tb.select("new.0","new.1", "id")
-
-points_tb = points_tb.withColumnRenamed("0","lat").withColumnRenamed("1","lon")
-points_tb.printSchema()
-
-points_tb.createOrReplaceTempView("points_tb")
-
-points_tb.show(5)
-
-coordinates_tb = spark.sql("select (select
collect_list(CONCAT(p1.lat,',',p1.lon)) from points_tb p1 where p1.id = p2.id
group by p1.id) as coordinates, p2.id, p2.maxspeed, p2.incline, p2.surface,
p2.name, p2.nodes, p2.total_nodes from formatted_df p2")
-coordinates_tb.createOrReplaceTempView("coordinates_tb")
-coordinates_tb.show(5)
-
-roads_tb = spark.sql("SELECT
ST_LineStringFromText(REPLACE(REPLACE(CAST(coordinates as
string),'[',''),']',''), ',') as geom, id, maxspeed, incline, surface, name,
nodes, total_nodes FROM coordinates_tb WHERE coordinates IS NOT NULL")
-roads_tb.createOrReplaceTempView("roads_tb")
-roads_tb.show(5)
-```
diff --git a/docs/tutorial/raster.md b/docs/tutorial/raster.md
index 641b441428..65541ced6c 100644
--- a/docs/tutorial/raster.md
+++ b/docs/tutorial/raster.md
@@ -21,7 +21,7 @@
Sedona uses 1-based indexing for all raster functions except [map algebra
function](../api/sql/Raster-map-algebra.md), which uses 0-based indexing.
!!!note
- Since v`1.5.0`, Sedona assumes geographic coordinates to be in
longitude/latitude order. If your data is lat/lon order, please use
`ST_FlipCoordinates` to swap X and Y.
+ Sedona assumes geographic coordinates to be in longitude/latitude order.
If your data is lat/lon order, please use `ST_FlipCoordinates` to swap X and Y.
Starting from `v1.1.0`, Sedona SQL supports raster data sources and raster
operators in DataFrame and SQL. Raster support is available in all Sedona
language bindings including ==Scala, Java, Python, and R==.
@@ -67,8 +67,6 @@ Detailed SedonaSQL APIs are available here: [SedonaSQL
API](../api/sql/Overview.
Use the following code to create your Sedona config at the beginning. If you
already have a SparkSession (usually named `spark`) created by Wherobots/AWS
EMR/Databricks, please skip this step and use `spark` directly.
-==Sedona >= 1.4.1==
-
You can add additional Spark runtime config to the config builder. For
example,
`SedonaContext.builder().config("spark.sql.autoBroadcastJoinThreshold",
"10485760")`
=== "Scala"
@@ -114,65 +112,10 @@ You can add additional Spark runtime config to the config
builder. For example,
```
Please replace the `3.3` in the package name of sedona-spark-shaded with
the corresponding major.minor version of Spark, such as
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-==Sedona < 1.4.1==
-
-The following method has been deprecated since Sedona 1.4.1. Please use the
method above to create your Sedona config.
-
-=== "Scala"
-
- ```scala
- var sparkSession = SparkSession.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- // Enable Sedona custom Kryo serializer
- .config("spark.serializer", classOf[KryoSerializer].getName) //
org.apache.spark.serializer.KryoSerializer
- .config("spark.kryo.registrator",
classOf[SedonaKryoRegistrator].getName)
- .getOrCreate() // org.apache.sedona.core.serde.SedonaKryoRegistrator
- ```
- If you use SedonaViz together with SedonaSQL, please use the following
two lines to enable Sedona Kryo serializer instead:
- ```scala
- .config("spark.serializer", classOf[KryoSerializer].getName) //
org.apache.spark.serializer.KryoSerializer
- .config("spark.kryo.registrator",
classOf[SedonaVizKryoRegistrator].getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
- ```
-
-=== "Java"
-
- ```java
- SparkSession sparkSession = SparkSession.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- // Enable Sedona custom Kryo serializer
- .config("spark.serializer", KryoSerializer.class.getName) //
org.apache.spark.serializer.KryoSerializer
- .config("spark.kryo.registrator", SedonaKryoRegistrator.class.getName)
- .getOrCreate() // org.apache.sedona.core.serde.SedonaKryoRegistrator
- ```
- If you use SedonaViz together with SedonaSQL, please use the following
two lines to enable Sedona Kryo serializer instead:
- ```scala
- .config("spark.serializer", KryoSerializer.class.getName) //
org.apache.spark.serializer.KryoSerializer
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
- ```
-
-=== "Python"
-
- ```python
- sparkSession = SparkSession. \
- builder. \
- appName('appName'). \
- config("spark.serializer", KryoSerializer.getName). \
- config("spark.kryo.registrator", SedonaKryoRegistrator.getName). \
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
- ```
- Please replace the `3.3` in the package name of sedona-spark-shaded with
the corresponding major.minor version of Spark, such as
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-
## Initiate SedonaContext
Add the following line after creating the Sedona config. If you already have a
SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks,
please call `SedonaContext.create(spark)` instead.
-==Sedona >= 1.4.1==
-
=== "Scala"
```scala
@@ -197,30 +140,6 @@ Add the following line after creating the Sedona config.
If you already have a S
sedona = SedonaContext.create(config)
```
-==Sedona < 1.4.1==
-
-The following method has been deprecated since Sedona 1.4.1. Please use the
method above to create your SedonaContext.
-
-=== "Scala"
-
- ```scala
- SedonaSQLRegistrator.registerAll(sparkSession)
- ```
-
-=== "Java"
-
- ```java
- SedonaSQLRegistrator.registerAll(sparkSession)
- ```
-
-=== "Python"
-
- ```python
- from sedona.register import SedonaRegistrator
-
- SedonaRegistrator.registerAll(spark)
- ```
-
You can also register everything by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
## Load data from files
diff --git a/docs/tutorial/rdd.md b/docs/tutorial/rdd.md
index fd5a1fcfde..d3b27896f7 100644
--- a/docs/tutorial/rdd.md
+++ b/docs/tutorial/rdd.md
@@ -765,9 +765,9 @@ Distance join can only accept `COVERED_BY` and `INTERSECTS`
as spatial predicate
The details of spatial partitioning in join query is
[here](#use-spatial-partitioning).
-The details of using spatial indexes in join query is
[here](#use-spatial-indexes-2).
+The details of using spatial indexes in join query is
[here](#use-spatial-indexes_2).
-The output format of the distance join query is [here](#output-format-2).
+The output format of the distance join query is [here](#output-format_2).
!!!note
Distance join query is equal to the following query in Spatial SQL:
diff --git a/docs/tutorial/snowflake/sql.md b/docs/tutorial/snowflake/sql.md
index 02ef7c7e01..ba42f23138 100644
--- a/docs/tutorial/snowflake/sql.md
+++ b/docs/tutorial/snowflake/sql.md
@@ -302,7 +302,7 @@ Please use the following steps:
### 1. Generate S2 ids for both tables
-Use [ST_S2CellIds](../../api/snowflake/vector-data/Function.md#ST_S2CellIDs)
to generate cell IDs. Each geometry may produce one or more IDs.
+Use [ST_S2CellIds](../../api/snowflake/vector-data/Function.md#st_s2cellids)
to generate cell IDs. Each geometry may produce one or more IDs.
```sql
SELECT * FROM lefts, TABLE(FLATTEN(ST_S2CellIDs(lefts.geom, 15))) s1
diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
index 4ea1ff0754..bd5327f675 100644
--- a/docs/tutorial/sql.md
+++ b/docs/tutorial/sql.md
@@ -20,7 +20,7 @@
The page outlines the steps to manage spatial data using SedonaSQL.
!!!note
- Since v`1.5.0`, Sedona assumes geographic coordinates to be in
longitude/latitude order. If your data is lat/lon order, please use
`ST_FlipCoordinates` to swap X and Y.
+ Sedona assumes geographic coordinates to be in longitude/latitude order.
If your data is lat/lon order, please use `ST_FlipCoordinates` to swap X and Y.
SedonaSQL supports SQL/MM Part3 Spatial SQL Standard. It includes four kinds
of SQL operators as follows. All these operators can be directly called through:
@@ -64,8 +64,6 @@ Detailed SedonaSQL APIs are available here: [SedonaSQL
API](../api/sql/Overview.
Use the following code to create your Sedona config at the beginning. If you
already have a SparkSession (usually named `spark`) created by AWS
EMR/Databricks/Microsoft Fabric, please ==skip this step==.
-==Sedona >= 1.4.1==
-
You can add additional Spark runtime config to the config builder. For
example,
`SedonaContext.builder().config("spark.sql.autoBroadcastJoinThreshold",
"10485760")`
=== "Scala"
@@ -111,65 +109,10 @@ You can add additional Spark runtime config to the config
builder. For example,
```
If you are using a different Spark version, please replace the `3.3` in
package name of sedona-spark-shaded with the corresponding major.minor version
of Spark, such as `sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-==Sedona < 1.4.1==
-
-The following method has been deprecated since Sedona 1.4.1. Please use the
method above to create your Sedona config.
-
-=== "Scala"
-
- ```scala
- var sparkSession = SparkSession.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- // Enable Sedona custom Kryo serializer
- .config("spark.serializer", classOf[KryoSerializer].getName) //
org.apache.spark.serializer.KryoSerializer
- .config("spark.kryo.registrator",
classOf[SedonaKryoRegistrator].getName)
- .getOrCreate() // org.apache.sedona.core.serde.SedonaKryoRegistrator
- ```
- If you use SedonaViz together with SedonaSQL, please use the following
two lines to enable Sedona Kryo serializer instead:
- ```scala
- .config("spark.serializer", classOf[KryoSerializer].getName) //
org.apache.spark.serializer.KryoSerializer
- .config("spark.kryo.registrator",
classOf[SedonaVizKryoRegistrator].getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
- ```
-
-=== "Java"
-
- ```java
- SparkSession sparkSession = SparkSession.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestJava") // Change this to a proper name
- // Enable Sedona custom Kryo serializer
- .config("spark.serializer", KryoSerializer.class.getName()) //
org.apache.spark.serializer.KryoSerializer
- .config("spark.kryo.registrator", SedonaKryoRegistrator.class.getName())
- .getOrCreate() // org.apache.sedona.core.serde.SedonaKryoRegistrator
- ```
- If you use SedonaViz together with SedonaSQL, please use the following
two lines to enable Sedona Kryo serializer instead:
- ```java
- .config("spark.serializer", KryoSerializer.class.getName()) //
org.apache.spark.serializer.KryoSerializer
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName()) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
- ```
-
-=== "Python"
-
- ```python
- sparkSession = SparkSession. \
- builder. \
- appName('readTestPython'). \
- config("spark.serializer", KryoSerializer.getName()). \
- config("spark.kryo.registrator", SedonaKryoRegistrator.getName()). \
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
- ```
- If you are using Spark versions >= 3.4, please replace the `3.0` in
package name of sedona-spark-shaded with the corresponding major.minor version
of Spark, such as `sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-
## Initiate SedonaContext
Add the following line after creating Sedona config. If you already have a
SparkSession (usually named `spark`) created by AWS EMR/Databricks/Microsoft
Fabric, please call `sedona = SedonaContext.create(spark)` instead. For
==Databricks==, the situation is more complicated, please refer to [Databricks
setup guide](../setup/databricks.md), but generally you don't need to create
SedonaContext.
-==Sedona >= 1.4.1==
-
=== "Scala"
```scala
@@ -194,33 +137,9 @@ Add the following line after creating Sedona config. If
you already have a Spark
sedona = SedonaContext.create(config)
```
-==Sedona < 1.4.1==
-
-The following method has been deprecated since Sedona 1.4.1. Please use the
method above to create your SedonaContext.
-
-=== "Scala"
-
- ```scala
- SedonaSQLRegistrator.registerAll(sparkSession)
- ```
-
-=== "Java"
-
- ```java
- SedonaSQLRegistrator.registerAll(sparkSession)
- ```
-
-=== "Python"
-
- ```python
- from sedona.register import SedonaRegistrator
-
- SedonaRegistrator.registerAll(spark)
- ```
-
You can also register everything by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
-## Load data from files
+## Load data from text files
Assume we have a WKT file, namely `usa-county.tsv`, at Path
`/Download/usa-county.tsv` as follows:
@@ -233,6 +152,8 @@ POLYGON (..., ...) Lancaster County
The file may have many other columns.
+### Load the raw DataFrame
+
Use the following code to load the data and create a raw DataFrame:
=== "Scala"
@@ -267,7 +188,7 @@ The output will be like this:
|POLYGON ((-96.910...| 31|109|00835876|31109| Lancaster| Lancaster County|
06| H1|G4020| 339|30700|null| A|2169240202|22877180|+40.7835474|-096.6886584|
```
-## Create a Geometry type column
+### Create a Geometry type column
All geometrical operations in SedonaSQL are on Geometry type objects.
Therefore, before any kind of queries, you need to create a Geometry type
column on a DataFrame.
@@ -315,56 +236,6 @@ root
Since `v1.6.1`, Sedona supports reading GeoJSON files using the `geojson` data
source. It is designed to handle JSON files that use [GeoJSON
format](https://datatracker.ietf.org/doc/html/rfc7946) for their geometries.
-This includes SpatioTemporal Asset Catalog (STAC) files, GeoJSON features,
GeoJSON feature collections and other variations.
-The key functionality lies in the way 'geometry' fields are processed: these
are specifically read as Sedona's `GeometryUDT` type, ensuring integration with
Sedona's suite of spatial functions.
-
-### Key features
-
-- Broad Support: The reader and writer are versatile, supporting all
GeoJSON-formatted files, including STAC files, feature collections, and more.
-- Geometry Transformation: When reading, fields named 'geometry' are
automatically converted from GeoJSON format to Sedona's `GeometryUDT` type and
vice versa when writing.
-
-### Load MultiLine GeoJSON FeatureCollection
-
-Suppose we have a GeoJSON FeatureCollection file as follows.
-This entire file is considered as a single GeoJSON FeatureCollection object.
-Multiline format is preferable for scenarios where files need to be
human-readable or manually edited.
-
-```json
-{ "type": "FeatureCollection",
- "features": [
- { "type": "Feature",
- "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
- "properties": {"prop0": "value0"}
- },
- { "type": "Feature",
- "geometry": {
- "type": "LineString",
- "coordinates": [
- [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
- ]
- },
- "properties": {
- "prop0": "value1",
- "prop1": 0.0
- }
- },
- { "type": "Feature",
- "geometry": {
- "type": "Polygon",
- "coordinates": [
- [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
- [100.0, 1.0], [100.0, 0.0] ]
- ]
- },
- "properties": {
- "prop0": "value2",
- "prop1": {"this": "that"}
- }
- }
- ]
-}
-```
-
Set the `multiLine` option to `True` to read multiline GeoJSON files.
=== "Python"
@@ -402,81 +273,7 @@ Set the `multiLine` option to `True` to read multiline
GeoJSON files.
df.printSchema();
```
-The output is as follows:
-
-```
-+--------------------+------+
-| geometry| prop0|
-+--------------------+------+
-| POINT (102 0.5)|value0|
-|LINESTRING (102 0...|value1|
-|POLYGON ((100 0, ...|value2|
-+--------------------+------+
-
-root
- |-- geometry: geometry (nullable = false)
- |-- prop0: string (nullable = true)
-
-```
-
-### Load Single Line GeoJSON Features
-
-Suppose we have a single-line GeoJSON Features dataset as follows. Each line
is a single GeoJSON Feature.
-This format is efficient for processing large datasets where each line is a
separate, self-contained GeoJSON object.
-
-```json
-{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}}
-{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}}
-{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}}
-```
-
-By default, when `option` is not specified, Sedona reads a GeoJSON file as a
single line GeoJSON.
-
-=== "Python"
-
- ```python
- df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json")
- .withColumn("prop0",
f.expr("properties['prop0']")).drop("properties").drop("type")
-
- df.show()
- df.printSchema()
- ```
-
-=== "Scala"
-
- ```scala
- val df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json")
- .withColumn("prop0",
expr("properties['prop0']")).drop("properties").drop("type")
-
- df.show()
- df.printSchema()
- ```
-
-=== "Java"
-
- ```java
- Dataset<Row> df =
sedona.read.format("geojson").load("PATH/TO/MYFILE.json")
- .withColumn("prop0",
expr("properties['prop0']")).drop("properties").drop("type")
-
- df.show()
- df.printSchema()
- ```
-
-The output is as follows:
-
-```
-+--------------------+------+
-| geometry| prop0|
-+--------------------+------+
-| POINT (102 0.5)|value0|
-|LINESTRING (102 0...|value1|
-|POLYGON ((100 0, ...|value2|
-+--------------------+------+
-
-root
- |-- geometry: geometry (nullable = false)
- |-- prop0: string (nullable = true)
-```
+See [this page](files/geojson-sedona-spark.md) for more information on loading
GeoJSON files.
## Load Shapefile
@@ -502,7 +299,7 @@ Since v`1.7.0`, Sedona supports loading Shapefile as a
DataFrame.
The input path can be a directory containing one or multiple shapefiles, or
path to a `.shp` file.
-See [this page](../files/shapefile-sedona-spark) for more information on
loading Shapefiles.
+See [this page](files/shapefiles-sedona-spark.md) for more information on
loading Shapefiles.
## Load GeoParquet
@@ -550,7 +347,31 @@ Please refer to [Reading Legacy Parquet
Files](../api/sql/Reading-legacy-parquet
GeoParquet file reader does not work on Databricks runtime when Photon
is enabled. Please disable Photon when using
GeoParquet file reader on Databricks runtime.
-See [this page](../files/geoparquet-sedona-spark) for more information on
loading GeoParquet.
+See [this page](files/geoparquet-sedona-spark.md) for more information on
loading GeoParquet.
+
+## Load data from STAC catalog
+
+Sedona STAC data source allows you to read data from a SpatioTemporal Asset
Catalog (STAC) API. The data source supports reading STAC items and collections.
+
+You can load a STAC collection from a s3 collection file object:
+
+```python
+df = sedona.read.format("stac").load(
+ "s3a://example.com/stac_bucket/stac_collection.json"
+)
+```
+
+You can also load a STAC collection from an HTTP/HTTPS endpoint:
+
+```python
+df = sedona.read.format("stac").load(
+
"https://earth-search.aws.element84.com/v1/collections/sentinel-2-pre-c1-l2a"
+)
+```
+
+The STAC data source supports predicate pushdown for spatial and temporal
filters. The data source can push down spatial and temporal filters to the
underlying data source to reduce the amount of data that needs to be read.
+
+See [this page](files/stac-sedona-spark.md) for more information on loading
data from STAC.
## Load data from JDBC data sources
@@ -612,7 +433,7 @@ For Postgis there is no need to add a query to convert
geometry types since it's
.withColumn("geom", f.expr("ST_GeomFromWKB(geom)")))
```
-## Load from GeoPackage
+## Load GeoPackage
Since v1.7.0, Sedona supports loading Geopackage file format as a DataFrame.
@@ -634,9 +455,9 @@ Since v1.7.0, Sedona supports loading Geopackage file
format as a DataFrame.
df = sedona.read.format("geopackage").option("tableName",
"tab").load("/path/to/geopackage")
```
-See [this page](../files/geopackage-sedona-spark) for more information on
loading GeoPackage.
+See [this page](files/geopackage-sedona-spark.md) for more information on
loading GeoPackage.
-## Load from OSM PBF
+## Load OSM PBF
Since v1.7.1, Sedona supports loading OSM PBF file format as a DataFrame.
@@ -732,14 +553,6 @@ and for relation
+-----+--------+--------+--------------------+--------------------+--------------------+--------------------+
```
-Known limitations (v1.7.0):
-
-- webp rasters are not supported
-- ewkb geometries are not supported
-- filtering based on geometries envelopes are not supported
-
-All points above should be resolved soon, stay tuned !
-
## Transform the Coordinate Reference System
Sedona doesn't control the coordinate unit (degree-based or meter-based) of
all geometries in a Geometry column. The unit of all related distances in
SedonaSQL is same as the unit of all geometries in a Geometry column.
@@ -828,7 +641,7 @@ The output will look like this:
+----------------+---+------+-------+
```
-See [this page](../concepts/clustering-algorithms) for more information on the
DBSCAN algorithm.
+See [this page](concepts/clustering-algorithms.md) for more information on the
DBSCAN algorithm.
## Calculate the Local Outlier Factor (LOF)
@@ -1393,7 +1206,7 @@ SELECT ST_AsText(countyshape)
FROM polygondf
```
-## Save as GeoJSON
+## Save GeoJSON
Since `v1.6.1`, the GeoJSON data source in Sedona can be used to save a
Spatial DataFrame to a single-line JSON file, with geometries written in
GeoJSON format.
@@ -1401,13 +1214,7 @@ Since `v1.6.1`, the GeoJSON data source in Sedona can be
used to save a Spatial
df.write.format("geojson").save("YOUR/PATH.json")
```
-The structure of the generated file will be like this:
-
-```json
-{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}}
-{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}}
-{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}}
-```
+See [this page](files/geojson-sedona-spark.md) for more information on writing
to GeoJSON.
## Save GeoParquet
@@ -1417,7 +1224,7 @@ Since v`1.3.0`, Sedona natively supports writing
GeoParquet file. GeoParquet can
df.write.format("geoparquet").save(geoparquetoutputlocation +
"/GeoParquet_File_Name.parquet")
```
-See [this page](../files/geoparquet-sedona-spark) for more information on
writing to GeoParquet.
+See [this page](files/geoparquet-sedona-spark.md) for more information on
writing to GeoParquet.
## Save to Postgis
diff --git a/mkdocs.yml b/mkdocs.yml
index eb478fc515..6f1cf1c5e1 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -66,6 +66,10 @@ nav:
- GeoParquet: tutorial/files/geoparquet-sedona-spark.md
- GeoJSON: tutorial/files/geojson-sedona-spark.md
- Shapefiles: tutorial/files/shapefiles-sedona-spark.md
+ - STAC catalog: tutorial/files/stac-sedona-spark.md
+ - Concepts:
+ - Spatial Joins: tutorial/concepts/spatial-joins.md
+ - Clustering Algorithms:
tutorial/concepts/clustering-algorithms.md
- Map visualization SQL app:
- Scala/Java: tutorial/viz.md
- Use Apache Zeppelin: tutorial/zeppelin.md
@@ -81,9 +85,6 @@ nav:
- Examples:
- Scala/Java: tutorial/demo.md
- Python: tutorial/jupyter-notebook.md
- - Concepts:
- - Spatial Joins: tutorial/concepts/spatial-joins.md
- - Clustering Algorithms: tutorial/concepts/clustering-algorithms.md
- API Docs:
- Sedona with Apache Spark:
- SQL:
@@ -97,7 +98,6 @@ nav:
- Query optimization: api/sql/Optimizer.md
- Nearest-Neighbour searching:
api/sql/NearestNeighbourSearching.md
- "Spider:Spatial Data Generator": api/sql/Spider.md
- - Reading STAC Data Source: api/sql/Stac.md
- Reading Legacy Parquet Files:
api/sql/Reading-legacy-parquet.md
- Visualization:
- SedonaPyDeck: api/sql/Visualization_SedonaPyDeck.md
@@ -145,11 +145,6 @@ nav:
- Make a release: community/publish.md
- Vote a release: community/vote.md
- Publications: community/publication.md
- - Use cases:
- - Spatially aggregate airports per country:
usecases/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb
- - Match foot traffic to Seattle coffee shops:
usecases/contrib/foot-traffic.ipynb
- - Raster image manipulation: usecases/ApacheSedonaRaster.ipynb
- - Read Overture Maps data: usecases/Sedona_OvertureMaps_GeoParquet.ipynb
- Apache Software Foundation:
- Foundation: asf/asf.md
- License: https://www.apache.org/licenses/" target="_blank
@@ -239,7 +234,5 @@ plugins:
- macros
- git-revision-date-localized:
type: datetime
- - mkdocs-jupyter:
- include_source: True
- mike:
canonical_version: "latest"