This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch prepare-1.4.0-doc
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/prepare-1.4.0-doc by this push:
new 95d4f074 Add EMR tutorial
95d4f074 is described below
commit 95d4f07475ba9220b5b0fb75784a6e26096940ef
Author: Jia Yu <[email protected]>
AuthorDate: Wed Mar 15 16:09:43 2023 -0700
Add EMR tutorial
---
docs/setup/emr.md | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
mkdocs.yml | 1 +
2 files changed, 49 insertions(+)
diff --git a/docs/setup/emr.md b/docs/setup/emr.md
new file mode 100644
index 00000000..f9e24ae7
--- /dev/null
+++ b/docs/setup/emr.md
@@ -0,0 +1,48 @@
+We recommend Sedona-1.3.1-incuabting and above for EMR. In the tutorial, we
use AWS Elastic MapReduce (EMR) 6.9.0. It has the following applications
installed: Hadoop 3.3.3, JupyterEnterpriseGateway 2.6.0, Livy 0.7.1, Spark
3.3.0.
+
+This tutorial is tested on EMR on EC2 with EMR Studio (notebooks). EMR on EC2
uses YARN to manage resources.
+
+## Prepare initialization script
+
+In your S3 bucket, add a script that has the following content:
+
+```bash
+#!/bin/bash
+
+# EMR clusters only have ephemeral local storage. It does not really matter
where we store the jars.
+sudo mkdir /jars
+
+# Download Sedona jar
+sudo curl -o /jars/sedona-spark-shaded-3.0_2.12-{{ sedona.current_version
}}.jar
"https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.0_2.12/{{
sedona.current_version }}/sedona-spark-shaded-3.0_2.12-{{
sedona.current_version }}.jar"
+
+# Download GeoTools jar
+sudo curl -o /jars/geotools-wrapper-{{ sedona.current_geotools }}.jar
"https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/{{
sedona.current_geotools }}/geotools-wrapper-{{ sedona.current_geotools }}.jar"
+
+# Install necessary python libraries
+sudo python3 -m pip install pandas shapely==1.8.5
+sudo python3 -m pip install pandas geopandas==0.10.2
+sudo python3 -m pip install attrs matplotlib descartes apache-sedona==1.4.0
+```
+
+When you create a EMR cluster, in the `bootstrap action`, specify the location
of this script.
+
+## Add software configuration
+
+When you create a EMR cluster, in the software configuration, add the
following content:
+
+```bash
+[
+ {
+ "Classification":"spark-defaults",
+ "Properties":{
+ "spark.yarn.dist.jars": "/jars/sedona-spark-shaded-3.0_2.12-{{
sedona.current_version }}.jar,/jars/geotools-wrapper-{{ sedona.current_geotools
}}.jar",
+ "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
+ "spark.kryo.registrator":
"org.apache.sedona.core.serde.SedonaKryoRegistrator",
+ "spark.sql.extensions":
"org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
+ }
+ }
+]
+```
+
+!!!note
+ If you use Sedona 1.3.1-incubating, please use
`sedona-python-adpater-3.0_2.12` jar in the content above, instead of
`sedona-spark-shaded-3.0_2.12`.
\ No newline at end of file
diff --git a/mkdocs.yml b/mkdocs.yml
index 2e41e4b2..b5981fcb 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -18,6 +18,7 @@ nav:
- Install Sedona R: api/rdocs
- Install Sedona-Zeppelin: setup/zeppelin.md
- Install on Databricks: setup/databricks.md
+ - Install on AWS EMR: setup/emr.md
- Set up Spark cluster: setup/cluster.md
- Install with Apache Flink:
- Install Sedona Scala/Java: setup/flink/install-scala.md