[incubator-sedona] branch master updated: [DOCS] Update Databricks documentation (#563)

jiayu Fri, 12 Nov 2021 15:07:39 -0800

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-sedona.git



The following commit(s) were added to refs/heads/master by this push:
     new df67799  [DOCS] Update Databricks documentation (#563)
df67799 is described below

commit df67799ea5d3372ab267ebe670dba08d800e6b54
Author: Erni Durdevic <[email protected]>
AuthorDate: Sat Nov 13 00:06:41 2021 +0100

    [DOCS] Update Databricks documentation (#563)
---
 docs/download/databricks.md | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/docs/download/databricks.md b/docs/download/databricks.md
index 9a1b0d5..c97a577 100644
--- a/docs/download/databricks.md
+++ b/docs/download/databricks.md
@@ -4,14 +4,16 @@ You just need to install the Sedona jars and Sedona Python on 
Databricks using D
 
 ## Advanced editions
 
-### Databricks DBR 7.x (Recommended)
+
+### Databricks DBR 7.x
 
 If you are using the commercial version of Databricks up to version 7.x you 
can install the Sedona jars and Sedona Python using the Databricks default web 
UI and everything should work.
 
 ### Databricks DBR 8.x, 9.x, 10.x
 
-If you are not using the free version of Databricks, there are currently some 
compatibility issues with DBR 8.x+. Specifically, the `ST_intersect` join query 
with the DataFrame API will throw a `java.lang.NoSuchMethodError` exception. As 
a temporary solution you can mix your DataFrame API with RDD API to perform 
spatial join queries (See 
[example](https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb)).
-
+If you are using the commercial version of Databricks for DBR 8.x+
+* You need to use sedona version `1.1.1-incubating` or higher. 
+* In order to activate the Kryo serializer (this speeds up the serialization 
and deserialization of geometry types) you need to install the libraries via 
init script as described below.
 
 ## Install Sedona from the web UI
 
@@ -26,14 +28,14 @@ If you are not using the free version of Databricks, there 
are currently some co
     apache-sedona
     ```
 
-3) (Optional) You can speed up the serialization of geometry types by adding 
to your spark configurations (`Cluster` -> `Edit` -> `Configuration` -> 
`Advanced options`) the following lines:
+3) (For DBR up to 7.3 LTS) You can speed up the serialization of geometry 
types by adding to your spark configurations (`Cluster` -> `Edit` -> 
`Configuration` -> `Advanced options`) the following lines:
 
     ```
     spark.serializer org.apache.spark.serializer.KryoSerializer
     spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
     ```
 
-    *This options are not compatible with the commercial Databricks DBR 
versions (8.x+).*
+    In order to activate this options for DBR versions 8.x+, you need to 
install the Sedona libraries via init script because libraries installed via UI 
are not yet available at cluster startup when this options are regiestered.
 
 ## Initialise
 
@@ -42,7 +44,7 @@ After you have installed the libraries and started the 
cluster, you can initiali
 (scala)
 ```Scala
 import org.apache.sedona.sql.utils.SedonaSQLRegistrator
-SedonaSQLRegistrator.registerAll(sparkSession)
+SedonaSQLRegistrator.registerAll(spark)
 ```
 
 (or python)
@@ -55,19 +57,21 @@ SedonaRegistrator.registerAll(spark)
  
 In order to use the Sedona `ST_*` functions from SQL without having to 
register the Sedona functions from a python/scala cell, you need to install the 
sedona libraries from the [cluster 
init-scripts](https://docs.databricks.com/clusters/init-scripts.html) as 
follows.
 
+## Install Sedona via init script
+
 Download the Sedona jars to a DBFS location. You can do that manually via UI 
or from a notebook with
 
 ```bash
 %sh 
 # Create JAR directory for Sedona
-mkdir -p /dbfs/jars/sedona/{{ sedona.current_version }}
+mkdir -p /dbfs/FileStore/jars/sedona/{{ sedona.current_version }}
 
 # Download the dependencies from Maven into DBFS
-curl -o /dbfs/jars/sedona/{{ sedona.current_version 
}}/geotools-wrapper-geotools-{{ sedona.current_geotools }}.jar 
"https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/geotools-{{ 
sedona.current_geotools }}/geotools-wrapper-geotools-{{ sedona.current_geotools 
}}.jar"
+curl -o /dbfs/FileStore/jars/sedona/{{ sedona.current_version 
}}/geotools-wrapper-geotools-{{ sedona.current_geotools }}.jar 
"https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/geotools-{{ 
sedona.current_geotools }}/geotools-wrapper-geotools-{{ sedona.current_geotools 
}}.jar"
 
-curl -o /dbfs/jars/sedona/{{ sedona.current_version 
}}/sedona-python-adapter-3.0_2.12-{{ sedona.current_version }}.jar 
"https://repo1.maven.org/maven2/org/apache/sedona/sedona-python-adapter-3.0_2.12/{{
 sedona.current_version }}/sedona-python-adapter-3.0_2.{{ 
sedona.current_version }}.jar"
+curl -o /dbfs/FileStore/jars/sedona/{{ sedona.current_version 
}}/sedona-python-adapter-3.0_2.12-{{ sedona.current_version }}.jar 
"https://repo1.maven.org/maven2/org/apache/sedona/sedona-python-adapter-3.0_2.12/{{
 sedona.current_version }}/sedona-python-adapter-3.0_2.{{ 
sedona.current_version }}.jar"
 
-curl -o /dbfs/jars/sedona/{{ sedona.current_version }}/sedona-viz-2.4_2.12-{{ 
sedona.current_version }}.jar 
"https://repo1.maven.org/maven2/org/apache/sedona/sedona-viz-2.4_2.12/{{ 
sedona.current_version }}/sedona-viz-2.4_2.12-{{ sedona.current_version }}.jar"
+curl -o /dbfs/FileStore/jars/sedona/{{ sedona.current_version 
}}/sedona-viz-2.4_2.12-{{ sedona.current_version }}.jar 
"https://repo1.maven.org/maven2/org/apache/sedona/sedona-viz-2.4_2.12/{{ 
sedona.current_version }}/sedona-viz-2.4_2.12-{{ sedona.current_version }}.jar"
 ```
 
 Create an init script in DBFS that loads the Sedona jars into the cluster's 
default jar directory. You can create that from any notebook by running: 
@@ -76,10 +80,10 @@ Create an init script in DBFS that loads the Sedona jars 
into the cluster's defa
 %sh 
 
 # Create init script directory for Sedona
-mkdir -p /dbfs/sedona/
+mkdir -p /dbfs/FileStore/sedona/
 
 # Create init script
-cat > /dbfs/sedona/sedona-init.sh <<'EOF'
+cat > /dbfs/FileStore/sedona/sedona-init.sh <<'EOF'
 #!/bin/bash
 #
 # File: sedona-init.sh
@@ -89,20 +93,22 @@ cat > /dbfs/sedona/sedona-init.sh <<'EOF'
 # On cluster startup, this script will copy the Sedona jars to the cluster's 
default jar directory.
 # In order to activate Sedona functions, remember to add to your spark 
configuration the Sedona extensions: "spark.sql.extensions 
org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
 
-cp /dbfs/jars/sedona/{{ sedona.current_version }}/*.jar /databricks/jars
+cp /dbfs/FileStore/jars/sedona/{{ sedona.current_version }}/*.jar 
/databricks/jars
 
 EOF
 ```
 
-From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` -> 
`Advanced options` -> `Spark`) activate the Sedona functions by adding to the 
Spark Config 
+From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` -> 
`Advanced options` -> `Spark`) activate the Sedona functions and the kryo 
serializer by adding to the Spark Config 
 ```
 spark.sql.extensions 
org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
+spark.serializer org.apache.spark.serializer.KryoSerializer
+spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
 ```
 
 From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` -> 
`Advanced options` -> `Init Scripts`) add the newly created init script 
 ```
-/dbfs/sedona/sedona-init.sh
+/dbfs/FileStore/sedona/sedona-init.sh
 ```
 
-*Note: You need to install the sedona libraries via init script because the 
libraries installed via UI are installed after the cluster has already started, 
and therefore the classes specified by the config `spark.sql.extensions` are 
not available at startup time.*
+*Note: You need to install the sedona libraries via init script because the 
libraries installed via UI are installed after the cluster has already started, 
and therefore the classes specified by the config `spark.sql.extensions`, 
`spark.serializer`, and `spark.kryo.registrator` are not available at startup 
time.*

[incubator-sedona] branch master updated: [DOCS] Update Databricks documentation (#563)

Reply via email to