This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/master by this push:
new 692b0088e3 [DOCS] update databricks setup instructions (#1963)
692b0088e3 is described below
commit 692b0088e3854f551b2f27c75945d628a28517da
Author: Matthew Powers <[email protected]>
AuthorDate: Tue Jun 3 01:54:09 2025 -0400
[DOCS] update databricks setup instructions (#1963)
* [DOCS] update databricks setup instructions
* update based on pr feedback
* run lint
* address pr comments
* remove reference to deleted section
---
docs/image/databricks/image1.png | Bin 0 -> 25688 bytes
docs/image/databricks/image2.png | Bin 0 -> 31551 bytes
docs/image/databricks/image3.png | Bin 0 -> 41030 bytes
docs/image/databricks/image4.png | Bin 0 -> 53196 bytes
docs/image/databricks/image5.png | Bin 0 -> 56423 bytes
docs/image/databricks/image6.png | Bin 0 -> 28553 bytes
docs/image/databricks/image7.png | Bin 0 -> 50136 bytes
docs/image/databricks/image8.png | Bin 0 -> 26013 bytes
docs/image/databricks/image9.png | Bin 0 -> 31958 bytes
docs/setup/databricks.md | 114 +++++++++++++++++++++++++++++----------
10 files changed, 86 insertions(+), 28 deletions(-)
diff --git a/docs/image/databricks/image1.png b/docs/image/databricks/image1.png
new file mode 100644
index 0000000000..aa855ff759
Binary files /dev/null and b/docs/image/databricks/image1.png differ
diff --git a/docs/image/databricks/image2.png b/docs/image/databricks/image2.png
new file mode 100644
index 0000000000..d1d8fb3a6f
Binary files /dev/null and b/docs/image/databricks/image2.png differ
diff --git a/docs/image/databricks/image3.png b/docs/image/databricks/image3.png
new file mode 100644
index 0000000000..f891bb6042
Binary files /dev/null and b/docs/image/databricks/image3.png differ
diff --git a/docs/image/databricks/image4.png b/docs/image/databricks/image4.png
new file mode 100644
index 0000000000..0d004d63ca
Binary files /dev/null and b/docs/image/databricks/image4.png differ
diff --git a/docs/image/databricks/image5.png b/docs/image/databricks/image5.png
new file mode 100644
index 0000000000..a29d0bc311
Binary files /dev/null and b/docs/image/databricks/image5.png differ
diff --git a/docs/image/databricks/image6.png b/docs/image/databricks/image6.png
new file mode 100644
index 0000000000..91db5b541a
Binary files /dev/null and b/docs/image/databricks/image6.png differ
diff --git a/docs/image/databricks/image7.png b/docs/image/databricks/image7.png
new file mode 100644
index 0000000000..0106017fba
Binary files /dev/null and b/docs/image/databricks/image7.png differ
diff --git a/docs/image/databricks/image8.png b/docs/image/databricks/image8.png
new file mode 100644
index 0000000000..163f7bffec
Binary files /dev/null and b/docs/image/databricks/image8.png differ
diff --git a/docs/image/databricks/image9.png b/docs/image/databricks/image9.png
new file mode 100644
index 0000000000..1e4245fbaf
Binary files /dev/null and b/docs/image/databricks/image9.png differ
diff --git a/docs/setup/databricks.md b/docs/setup/databricks.md
index 75bdc1beb7..6a84cd3a19 100644
--- a/docs/setup/databricks.md
+++ b/docs/setup/databricks.md
@@ -17,18 +17,29 @@
under the License.
-->
-In Databricks advanced editions, you need to install Sedona via [cluster
init-scripts](https://docs.databricks.com/clusters/init-scripts.html) as
described below. Sedona is not guaranteed to be 100% compatible with
`Databricks photon acceleration`. Sedona requires Spark internal APIs to inject
many optimization strategies, which sometimes is not accessible in `Photon`.
+You can run Sedona in Databricks to leverage the functionality that Sedona
provides. Here’s an example of a Databricks notebook that’s running Sedona
code:
-The following steps use DBR including Apache Spark 3.5.x as an example. Please
change the Spark version according to your DBR version. Please pay attention to
the Spark version postfix and Scala version postfix on our [Maven Coordinate
page](maven-coordinates.md). Databricks Spark and Apache Spark's compatibility
can be found
[here](https://docs.databricks.com/en/release-notes/runtime/index.html).
+
-!!! bug
- Databricks Runtime 16.2 (non-LTS) introduces a change in the json4s
dependency, which may lead to compatibility issues with Apache Sedona. We
recommend using a currently supported LTS version, such as Databricks Runtime
15.4 LTS or 14.3 LTS, to ensure stability. A patch will be provided once an
official Databricks Runtime 16 LTS version is released.
+Sedona isn’t available in all Databricks environments because of the
platform's limitations. This post explains how and where you can run Sedona in
Databricks.
-### Download Sedona jars
+## Databricks and Sedona version requirements
-Download the Sedona jars to a DBFS location. You can do that manually via UI
or from a notebook by executing this code in a cell:
+Databricks and Sedona depend on Spark, Scala, and other libraries.
-```bash
+For example, one Databricks Runtime 16.4 depends on Scala 2.12 and Spark 3.5.
Here are the version requirements for a few Databricks runtimes.
+
+
+
+If you use a Databricks Runtime compiled with Spark 3.5 and Scala 2.12, then
you should use a Sedona version compiled with Spark 3.5 and Scala 2.12. You
need to make sure the Scala versions are aligned, even if you’re using the
Python or SQL APIs.
+
+Only some Sedona functions work when Databricks Photon acceleration is
enabled, so you can consider disabling Photon when using Sedona for better
compatibility.
+
+## Install the Sedona library in Databricks
+
+Download the required Sedona packages by executing the following commands:
+
+```sh
%sh
# Create JAR directory for Sedona
mkdir -p /Workspace/Shared/sedona/{{ sedona.current_version }}
@@ -39,16 +50,21 @@ curl -o /Workspace/Shared/sedona/{{ sedona.current_version
}}/geotools-wrapper-{
curl -o /Workspace/Shared/sedona/{{ sedona.current_version
}}/sedona-spark-shaded-3.5_2.12-{{ sedona.current_version }}.jar
"https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.5_2.12/{{
sedona.current_version }}/sedona-spark-shaded-3.5_2.12-{{
sedona.current_version }}.jar"
```
-Of course, you can also do the steps above manually.
+Here are the software versions used to compile
`sedona-spark-shaded-3.5_2.12-1.7.1.jar`:
+
+* Spark 3.5
+* Scala 2.12
+* Sedona 1.7.1
+
+Ensure that you use a Databricks Runtime with versions compatible with this
jar.
-### Create an init script
+You will be able to see these in your Databricks environment after downloading
them:
-!!!note
- If you are creating a Shared cluster, you won't be able to use init
scripts and jars stored under `Workspace`. Please instead store them in
`Volumes`. The overall process should be the same.
+
-Create an init script in `Workspace` that loads the Sedona jars into the
cluster's default jar directory. You can create that from any notebook by
running:
+Create an init script as follows:
-```bash
+```
%sh
# Create init script directory for Sedona
@@ -62,16 +78,26 @@ cat > /Workspace/Shared/sedona/sedona-init.sh <<'EOF'
#
# On cluster startup, this script will copy the Sedona jars to the cluster's
default jar directory.
-cp /Workspace/Shared/sedona/{{ sedona.current_version }}/*.jar /databricks/jars
+cp /Workspace/Shared/sedona/1.7.1/*.jar /databricks/jars
EOF
```
-Of course, you can also do the steps above manually.
+## Create a Databricks cluster
+
+You need to create a Databricks cluster compatible with the Sedona JAR files.
If you use Sedona JAR files compiled with Scala 2.12, you must use a Databricks
cluster that runs Scala 2.12.
+
+Databricks Photon is only partially compatible with Apache Sedona, so you will
have better compatibility if you unselect the Photon button when configuring
the cluster.
+
+Go to the compute tab and configure the cluster:
-### Set up cluster config
+
-From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` ->
`Advanced options` -> `Spark`) activate the Sedona functions and the kryo
serializer by adding to the Spark Config
+Set the proper cluster configurations:
+
+
+
+Here’s a list of the cluster configurations that’s easy to copy and paste:
```
spark.sql.extensions
org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
@@ -80,13 +106,17 @@ spark.kryo.registrator
org.apache.sedona.core.serde.SedonaKryoRegistrator
spark.sedona.enableParserExtension false
```
-From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` ->
`Advanced options` -> `Init Scripts`) add the newly created `Workspace` init
script
+Specify the path to the init script:
+
+
-| Type | File path |
-|------|-----------|
-| Workspace | /Shared/sedona/sedona-init.sh |
+If you are creating a Shared cluster, you won't be able to use init scripts
and jars stored under Workspace. Please store them in volumes instead. The
overall process should be the same.
-For enabling python support, from the Libraries tab install from PyPI
+Add the required dependencies in the Library tab:
+
+
+
+Here’s the full list of libraries:
```
apache-sedona=={{ sedona.current_version }}
@@ -95,15 +125,43 @@ keplergl==0.3.7
pydeck==0.9.1
```
-!!!tips
- You need to install the Sedona libraries via init script because the
libraries installed via UI are installed after the cluster has already started,
and therefore the classes specified by the config `spark.sql.extensions`,
`spark.serializer`, and `spark.kryo.registrator` are not available at startup
time.*
+Then click “Create compute” to start the cluster.
+
+## Create a Databricks notebook
+
+Create a Databricks notebook and connect it to the cluster. Verify that you
can run a Python computation with a Sedona function:
-### Verify installation
+
-After you have started the cluster, you can verify that Sedona is correctly
installed by running the following code in a notebook:
+You can also use the SQL API as follows:
+
+
+
+## Saving geometry in Databricks Delta Lake tables
+
+Here’s how to create a Sedona DataFrame with a geometry column
```python
-spark.sql("SELECT ST_Point(1, 1)").show()
+df = sedona.createDataFrame([
+ ('a', 'POLYGON((1.0 1.0,1.0 3.0,2.0 3.0,2.0 1.0,1.0 1.0))'),
+ ('b', 'LINESTRING(4.0 1.0,4.0 2.0,6.0 4.0)'),
+ ('c', 'POINT(9.0 2.0)'),
+], ["id", "geometry"])
+df = df.withColumn("geometry", expr("ST_GeomFromWKT(geometry)"))
```
-Note that: you don't need to run the `SedonaRegistrator.registerAll(spark)` or
`SedonaContext.create(spark)` in the advanced edition because
`org.apache.sedona.sql.SedonaSqlExtensions` in the Cluster Config will take
care of that.
+Write the Sedona DataFrame to a Delta Lake table:
+
+```python
+df.write.saveAsTable("your_org.default.geotable")
+```
+
+Here’s how to read the table:
`sedona.table("your_org.default.geotable").display()`
+
+This is what the results look like in Databricks:
+
+
+
+## Known bugs
+
+To ensure stability, we recommend using a currently supported Long-Term
Support (LTS) version, such as Databricks Runtime 16.4 LTS or 15.4 LTS. Some
Databricks Runtimes, such as 16.2 (non-LTS), are not compatible with Apache
Sedona, as this particular runtime introduced a change in the json4s dependency.