Re: [PR] [GH-2804] Raster tutorial: end-to-end running example with visuals [sedona]

via GitHub Sat, 16 May 2026 23:42:11 -0700


jiayuasu commented on code in PR #2954:
URL: https://github.com/apache/sedona/pull/2954#discussion_r3254156011



##########
docs/tutorial/raster.md:
##########
@@ -90,625 +62,618 @@ You can add additional Spark runtime config to the config 
builder. For example,
        import org.apache.sedona.spark.SedonaContext;
 
        SparkSession config = SedonaContext.builder()
-       .master("local[*]") // Delete this if run in cluster mode
-       .appName("readTestScala") // Change this to a proper name
-       .getOrCreate()
-       ```
-       If you use SedonaViz together with SedonaSQL, please add the following 
line after `SedonaContext.builder()` to enable Sedona Kryo serializer:
-       ```scala
-       .config("spark.kryo.registrator", 
SedonaVizKryoRegistrator.class.getName) // 
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+         .master("local[*]")
+         .appName("rasterTutorial")
+         .getOrCreate();
+       SparkSession sedona = SedonaContext.create(config);
        ```
 
 === "Python"
 
        ```python
-       from sedona.spark import *
-
-       config = SedonaContext.builder() .\
-           config('spark.jars.packages',
-                  'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{ 
sedona.current_version }},'
-                  'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools 
}}'). \
-           getOrCreate()
+       from sedona.spark import SedonaContext
+
+       config = (
+           SedonaContext.builder()
+           .config(
+               "spark.jars.packages",
+               "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{ 
sedona.current_version }},"
+               "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+           )
+           .getOrCreate()
+       )
+       sedona = SedonaContext.create(config)
        ```
-    Please replace the `3.3` in the package name of sedona-spark-shaded with 
the corresponding major.minor version of Spark, such as 
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-
-## Initiate SedonaContext
-
-Add the following line after creating the Sedona config. If you already have a 
SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks, 
please call `SedonaContext.create(spark)` instead.
+       Replace `3.3` with the major.minor version of your Spark install (for 
example `sedona-spark-shaded-3.4_2.12`).
 
-=== "Scala"
-
-       ```scala
-       import org.apache.sedona.spark.SedonaContext
+You can also register Sedona by passing `--conf 
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to 
`spark-submit` or `spark-shell`.
 
-       val sedona = SedonaContext.create(config)
-       ```
+## End-to-end walkthrough
 
-=== "Java"
+The walkthrough uses a single 2-band GeoTIFF — red and near-infrared 
reflectance over a small AOI — and carries it through every stage of a typical 
raster workflow. The scene is synthesized in Python so the example is fully 
reproducible and ships no extra bytes. The same SQL runs unchanged against real 
Sentinel-2 chips; only the input path changes.
 
-       ```java
-       import org.apache.sedona.spark.SedonaContext;
+![Pipeline overview](../image/raster-tutorial/00-pipeline.png)
 
-       SparkSession sedona = SedonaContext.create(config)
-       ```
+??? example "What real rasters look like"
 
-=== "Python"
+    The same code paths handle anything the GeoTIFF spec supports. Two 
examples from Sedona's own test resources:
 
-       ```python
-       from sedona.spark import *
+    | 3-band color raster | Single-band raster |
+    | :--- | :--- |
+    | ![Multi-band](../image/raster-tutorial/real-rgb.png) | 
![Single-band](../image/raster-tutorial/real-singleband.png) |
 
-       sedona = SedonaContext.create(config)
-       ```
+    `RS_NumBands(rast)` would return `3` and `1` respectively. Band-level 
functions like `RS_Band(rast, ARRAY(1,2,3))` and `RS_MapAlgebra` work the same 
way on both.
 
-You can also register everything by passing `--conf 
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to 
`spark-submit` or `spark-shell`.
+### 1. Create the input scene
 
-## Load GeoTiff data
+Synthesize a 256 × 256 raster with a circular vegetated field. Real workflows 
skip this step and point Sedona at existing GeoTIFFs on disk or in object 
storage.
 
-The recommended way to load GeoTiff raster data is the `raster` data source. 
It loads GeoTiff files and automatically splits them into smaller tiles. Each 
tile becomes a row in the resulting DataFrame stored in `Raster` format.
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60)  # xmin, ymin, xmax, ymax in EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2  # circular vegetated field
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0, 
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+    f"{WORK}/scene.tif",
+    "w",
+    driver="GTiff",
+    tiled=True,
+    blockxsize=256,
+    blockysize=256,
+    height=H,
+    width=W,
+    count=2,
+    dtype="uint16",
+    crs="EPSG:4326",
+    transform=transform,
+) as dst:
+    dst.write(red, 1)
+    dst.set_band_description(1, "red")
+    dst.write(nir, 2)
+    dst.set_band_description(2, "nir")
+```
+
+### 2. Load with the `raster` data source
+
+The `raster` data source loads GeoTIFFs and automatically splits each file 
into tiles. Every tile becomes a row in a DataFrame with a `Raster`-typed 
column.
 
 === "Scala"
-    ```scala
-    var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
-    rasterDf.createOrReplaceTempView("rasterDf")
-    rasterDf.show()
-    ```
+
+       ```scala
+       val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+       rasterDf.createOrReplaceTempView("rasterDf")
+       rasterDf.show()
+       ```
 
 === "Java"
-    ```java
-    Dataset<Row> rasterDf = 
sedona.read().format("raster").load("/some/path/*.tif");
-    rasterDf.createOrReplaceTempView("rasterDf");
-    rasterDf.show();
-    ```
+
+       ```java
+       Dataset<Row> rasterDf = sedona.read().format("raster").load(WORK + 
"/scene.tif");
+       rasterDf.createOrReplaceTempView("rasterDf");
+       rasterDf.show();
+       ```
 
 === "Python"
-    ```python
-    rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
-    rasterDf.createOrReplaceTempView("rasterDf")
-    rasterDf.show()
-    ```
 
-The output will look like this:
+       ```python
+       rasterDf = sedona.read.format("raster").load(f"{WORK}/scene.tif")
+       rasterDf.createOrReplaceTempView("rasterDf")
+       rasterDf.show()
+       ```
 
 ```
-+--------------------+---+---+----+
-|                rast|  x|  y|name|
-+--------------------+---+---+----+
-|GridCoverage2D["g...|  0|  0| ...|
-|GridCoverage2D["g...|  1|  0| ...|
-|GridCoverage2D["g...|  2|  0| ...|
-...
++--------------------+---+---+----------+
+|                rast|  x|  y|      name|
++--------------------+---+---+----------+
+|GridCoverage2D["g...|  0|  0| scene.tif|
++--------------------+---+---+----------+
 ```
 
-The output contains the following columns:
+The columns are:
 
-- `rast`: The raster data in `Raster` format.
-- `x`: The 0-based x-coordinate of the tile. This column is only present when 
retile is not disabled.
-- `y`: The 0-based y-coordinate of the tile. This column is only present when 
retile is not disabled.
-- `name`: The name of the raster file.
+- `rast` — the raster, in Sedona's `Raster` type.
+- `x`, `y` — the 0-based tile index inside the source file (present when 
tiling is enabled).
+- `name` — the source filename.
 
-### Tiling options
+The 256 × 256 scene fits in a single tile here, so you get one row. A 
multi-gigabyte GeoTIFF would yield many rows — the same downstream SQL works in 
both cases.
 
-By default, tiling is enabled (`retile = true`) and the tile size is 
determined by the GeoTiff file's internal tiling scheme — you do not need to 
specify `tileWidth` or `tileHeight`. It is recommended to use [Cloud Optimized 
GeoTIFF (COG)](https://www.cogeo.org/) format for raster data since they 
usually organize pixel data as square tiles.
+![Tiling flow](../image/raster-tutorial/tiling-flow.svg)
 
-You can optionally override the tile size, or disable tiling entirely:
+See [Loading options](#loading-options) below for tile-size overrides, 
recursive directory globs, and non-GeoTIFF formats such as NetCDF and Arc Grid.
 
-| Option | Default | Description |
-| :--- | :--- | :--- |
-| `retile` | `true` | Whether to enable tiling. Set to `false` to load the 
entire raster as a single row. |
-| `tileWidth` | GeoTiff's internal tile width | Optional. Override the width 
of each tile in pixels. |
-| `tileHeight` | Same as `tileWidth` if set, otherwise GeoTiff's internal tile 
height | Optional. Override the height of each tile in pixels. |
-| `padWithNoData` | `false` | Pad the right and bottom tiles with NODATA 
values if they are smaller than the specified tile size. |
-
-To override the tile size:
+### 3. Inspect metadata
 
-=== "Python"
-    ```python
-    rasterDf = (
-        sedona.read.format("raster")
-        .option("tileWidth", "256")
-        .option("tileHeight", "256")
-        .load("/some/path/*.tif")
-    )
-    ```
-
-!!!note
-    If the internal tiling scheme of raster data is not friendly for tiling, 
the `raster` data source will throw an error, and you can disable automatic 
tiling using `option("retile", "false")`, or specify the tile size manually to 
workaround this issue. A better solution is to translate the raster data into 
COG format using `gdal_translate` or other tools.
-
-### Loading raster files from directories
-
-The `raster` data source also works with Spark generic file source options, 
such as `option("pathGlobFilter", "*.tif*")` and `option("recursiveFileLookup", 
"true")`. For instance, you can load all the `.tif` files recursively in a 
directory using:
-
-=== "Python"
-    ```python
-    rasterDf = (
-        sedona.read.format("raster")
-        .option("recursiveFileLookup", "true")
-        .option("pathGlobFilter", "*.tif*")
-        .load(path_to_raster_data_folder)
-    )
-    ```
+Confirm pixel dimensions, georeference, and CRS before processing:
 
-!!!tip
-    When the loaded path ends with `/`, the `raster` data source will look up 
raster files in the directory and all its subdirectories recursively. This is 
equivalent to specifying a path without trailing `/` and setting 
`option("recursiveFileLookup", "true")`.
+```python
+sedona.sql("""
+    SELECT RS_Width(rast)       AS width,
+           RS_Height(rast)      AS height,
+           RS_NumBands(rast)    AS bands,
+           RS_SRID(rast)        AS srid,
+           RS_GeoReference(rast) AS world_file
+    FROM rasterDf
+""").show(truncate=False)
+```

Review Comment:
   The RS_GeoReference doc page (linked from the metadata reference table) 
already has a full callout explaining the `\n`-escaping behavior and the 
`collect()` workaround. Repeating it inline here would duplicate maintenance 
for a side concern that doesn't affect understanding the walkthrough's main 
flow. The escaped output in the example is consistent with what `show()` 
actually produces, which is what readers will see if they run the snippet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2804] Raster tutorial: end-to-end running example with visuals [sedona]

Reply via email to