(sedona) branch master updated: [DOCS] add glue tutorial (#1453)

jiayu Wed, 05 Jun 2024 22:26:12 -0700

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git



The following commit(s) were added to refs/heads/master by this push:
     new 3960db81c [DOCS] add glue tutorial (#1453)
3960db81c is described below

commit 3960db81c6aa4a66f8d9113f8f64ef18b1c1d1ec
Author: James Willis <[email protected]>
AuthorDate: Wed Jun 5 22:26:03 2024 -0700

    [DOCS] add glue tutorial (#1453)
    
    * add glue tutorial
    
    * revise glue tutorial with PR feedback
    
    * add spark/scala warning for picking out the jars
    
    ---------
    
    Co-authored-by: jameswillis <[email protected]>
---
 docs/setup/compile.md |   1 +
 docs/setup/glue.md    | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mkdocs.yml            |   1 +
 3 files changed, 122 insertions(+)

diff --git a/docs/setup/compile.md b/docs/setup/compile.md
index 7d2fd8d96..5589874f5 100644
--- a/docs/setup/compile.md
+++ b/docs/setup/compile.md
@@ -149,6 +149,7 @@ pip install mkdocs-material
 pip install mkdocs-macros-plugin
 pip install mkdocs-git-revision-date-localized-plugin
 pip install mike
+pip install pymdown-extensions
 ```
 
 After installing MkDocs and MkDocs-Material, run these commands in the Sedona 
root folder:
diff --git a/docs/setup/glue.md b/docs/setup/glue.md
new file mode 100644
index 000000000..e13e6b478
--- /dev/null
+++ b/docs/setup/glue.md
@@ -0,0 +1,120 @@
+
+This tutorial will cover how to configure both a glue notebook and a glue ETL 
job. The tutorial is written assuming you
+have a working knowledge of AWS Glue jobs and S3.
+
+In the tutorial, we use
+Sedona {{ sedona.current_version }} and [Glue 
4.0](https://docs.aws.amazon.com/glue/latest/dg/release-notes.html) which runs 
on Spark 3.3.0, Java 8, Scala 2.12,
+and Python 3.10. We recommend Sedona-1.3.1-incubating and above for Glue.
+
+## Stage Sedona Jar in S3
+
+In an AWS S3 bucket you will need the Sedona-spark-shaded and geotools-wrapper 
jars. There are two options to get these.
+Ensure the locations of the jars are accessible to your glue job.
+
+!!!note
+    Ensure you pick a version for Scala 2.12 and Spark 3.0. The Spark 3.4 and 
Scala 2.13 jars are not compatible with
+    Glue 4.0.
+
+### Option 1: Use the Wherobots-hosted jars
+
+[Wherobots](https://wherobots.com/) provides a public S3 bucket with the 
necessary jars. You can point to these directly in your glue jobs'
+configurations. For {{ sedona.current_version }}, the Sedona and geotools jars 
are available at the following locations:
+
+* `s3://wherobots-sedona-jars/{{ sedona.current_version 
}}/sedona-spark-shaded-3.0_2.12-{{ sedona.current_version }}.jar`
+* `s3://wherobots-sedona-jars/{{ sedona.current_version }}/geotools-wrapper-{{ 
sedona.current_version }}-28.2.jar`
+
+### Option 2: Stage your own Jars
+
+In your S3 bucket, add the Sedona Jar from
+[Maven 
Central](https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.0_2.12/{{
 sedona.current_version }}/sedona-spark-shaded-3.0_2.12-{{ 
sedona.current_version }}.jar).
+
+Similarly, add the Geotools Wrapper Jar
+from [Maven 
Central](https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/{{ 
sedona.current_version }}-28.2/geotools-wrapper-{{ sedona.current_version 
}}-28.2.jar).
+
+!!!note
+    If you use Sedona 1.3.1-incubating, please use 
`sedona-python-adpater-3.0_2.12` jar in the content above, instead
+    of `sedona-spark-shaded-3.0_2.12`. Ensure you pick a version for Scala 
2.12 and Spark 3.0. The Spark 3.4 and Scala
+    2.13 jars are not compatible with Glue 4.0.
+
+## Configure Glue Job
+
+Once your jars are staged in S3, you can configure your Glue job to use them, 
as well as the apache-sedona python
+package. How you do this varies slightly between the notebook and the script 
job types.
+
+!!!note
+    Always ensure that the Sedona version of the jars and the python package 
match.
+
+### Notebook Job
+
+Add the following cell magics before starting your sparkContext of 
glueContext. The first points to the jars put in s3,
+and the second installs the Sedona python package directly from pip.
+
+```python
+# Sedona Config
+%extra_jars s3://path/to/my-sedona.jar, s3://path/to/my-geotools.jar
+%additional_python_modules apache-sedona
+```
+
+If using the Wherobots-provided jars, the cell magics would look like this:
+
+```python
+# Sedona Config
+%extra_jars s3://wherobots-sedona-jars/{{ sedona.current_version 
}}/sedona-spark-shaded-3.0_2.12-{{ sedona.current_version }}.jar, 
s3://wherobots-sedona-jars/{{ sedona.current_version }}/geotools-wrapper-{{ 
sedona.current_version }}-28.2.jar
+%additional_python_modules apache-sedona=={{ sedona.current_version }}
+```
+
+If you are using the example notebook from glue, the first cell should now 
look like this:
+
+```python
+%idle_timeout 2880
+%glue_version 4.0
+%worker_type G.1X
+%number_of_workers 5
+
+# Sedona Config
+%extra_jars s3://wherobots-sedona-jars/{{ sedona.current_version 
}}/sedona-spark-shaded-3.0_2.12-{{ sedona.current_version }}.jar, 
s3://wherobots-sedona-jars/{{ sedona.current_version }}/geotools-wrapper-{{ 
sedona.current_version }}-28.2.jar
+%additional_python_modules apache-sedona=={{ sedona.current_version }}
+
+
+import sys
+from awsglue.transforms import *
+from awsglue.utils import getResolvedOptions
+from pyspark.context import SparkContext
+from awsglue.context import GlueContext
+from awsglue.job import Job
+
+sc = SparkContext.getOrCreate()
+glueContext = GlueContext(sc)
+spark = glueContext.spark_session
+job = Job(glueContext)
+```
+
+You can confirm your installation by running the following cell:
+
+```python
+from sedona.spark import *
+
+sedona = SedonaContext.create(spark)
+sedona.sql("SELECT ST_POINT(1., 2.) as geom").show()
+```
+
+### ETL Job
+
+Glue also calls these Scripts. From your job's page, navigate to the "Job 
details" tab. At the bottom of the page expand
+the "Advanced properties" section. In the "Dependent JARs path" field, add the 
paths to the jars in S3, separated by a comma.
+
+To add the Sedona python package, navigate to the "Job Parameters" section and 
add a new parameter with the key
+`--additional-python-modules` and the value `apache-sedona=={{ 
sedona.current_version }}`.
+
+To confirm the installation add the follow code to the script:
+
+```python
+from sedona.spark import *
+
+config = SedonaContext.builder().getOrCreate()
+sedona = SedonaContext.create(config)
+
+sedona.sql("SELECT ST_POINT(1., 2.) as geom").show()
+```
+
+Once added to the script, save and run the job. If the job runs successfully, 
the installation was successful.
diff --git a/mkdocs.yml b/mkdocs.yml
index 0661cec70..f3060dd72 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -23,6 +23,7 @@ nav:
         - Install on Wherobots: setup/wherobots.md
         - Install on Databricks: setup/databricks.md
         - Install on AWS EMR: setup/emr.md
+        - Install on AWS Glue: setup/glue.md
         - Install on Microsoft Fabric: setup/fabric.md
         - Set up Spark cluster manually: setup/cluster.md
       - Install with Apache Flink:

(sedona) branch master updated: [DOCS] add glue tutorial (#1453)

Reply via email to