jiayuasu commented on code in PR #2954:
URL: https://github.com/apache/sedona/pull/2954#discussion_r3254156011
##########
docs/tutorial/raster.md:
##########
@@ -90,625 +62,618 @@ You can add additional Spark runtime config to the config
builder. For example,
import org.apache.sedona.spark.SedonaContext;
SparkSession config = SedonaContext.builder()
- .master("local[*]") // Delete this if run in cluster mode
- .appName("readTestScala") // Change this to a proper name
- .getOrCreate()
- ```
- If you use SedonaViz together with SedonaSQL, please add the following
line after `SedonaContext.builder()` to enable Sedona Kryo serializer:
- ```scala
- .config("spark.kryo.registrator",
SedonaVizKryoRegistrator.class.getName) //
org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
+ .master("local[*]")
+ .appName("rasterTutorial")
+ .getOrCreate();
+ SparkSession sedona = SedonaContext.create(config);
```
=== "Python"
```python
- from sedona.spark import *
-
- config = SedonaContext.builder() .\
- config('spark.jars.packages',
- 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},'
- 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools
}}'). \
- getOrCreate()
+ from sedona.spark import SedonaContext
+
+ config = (
+ SedonaContext.builder()
+ .config(
+ "spark.jars.packages",
+ "org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{
sedona.current_version }},"
+ "org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}",
+ )
+ .getOrCreate()
+ )
+ sedona = SedonaContext.create(config)
```
- Please replace the `3.3` in the package name of sedona-spark-shaded with
the corresponding major.minor version of Spark, such as
`sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`.
-
-## Initiate SedonaContext
-
-Add the following line after creating the Sedona config. If you already have a
SparkSession (usually named `spark`) created by Wherobots/AWS EMR/Databricks,
please call `SedonaContext.create(spark)` instead.
+ Replace `3.3` with the major.minor version of your Spark install (for
example `sedona-spark-shaded-3.4_2.12`).
-=== "Scala"
-
- ```scala
- import org.apache.sedona.spark.SedonaContext
+You can also register Sedona by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
- val sedona = SedonaContext.create(config)
- ```
+## End-to-end walkthrough
-=== "Java"
+The walkthrough uses a single 2-band GeoTIFF — red and near-infrared
reflectance over a small AOI — and carries it through every stage of a typical
raster workflow. The scene is synthesized in Python so the example is fully
reproducible and ships no extra bytes. The same SQL runs unchanged against real
Sentinel-2 chips; only the input path changes.
- ```java
- import org.apache.sedona.spark.SedonaContext;
+
- SparkSession sedona = SedonaContext.create(config)
- ```
+??? example "What real rasters look like"
-=== "Python"
+ The same code paths handle anything the GeoTIFF spec supports. Two
examples from Sedona's own test resources:
- ```python
- from sedona.spark import *
+ | 3-band color raster | Single-band raster |
+ | :--- | :--- |
+ |  |
 |
- sedona = SedonaContext.create(config)
- ```
+ `RS_NumBands(rast)` would return `3` and `1` respectively. Band-level
functions like `RS_Band(rast, ARRAY(1,2,3))` and `RS_MapAlgebra` work the same
way on both.
-You can also register everything by passing `--conf
spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to
`spark-submit` or `spark-shell`.
+### 1. Create the input scene
-## Load GeoTiff data
+Synthesize a 256 × 256 raster with a circular vegetated field. Real workflows
skip this step and point Sedona at existing GeoTIFFs on disk or in object
storage.
-The recommended way to load GeoTiff raster data is the `raster` data source.
It loads GeoTiff files and automatically splits them into smaller tiles. Each
tile becomes a row in the resulting DataFrame stored in `Raster` format.
+```python
+import os
+import numpy as np
+import rasterio
+from rasterio.transform import from_bounds
+
+WORK = "/tmp/sedona-raster-tutorial"
+os.makedirs(WORK, exist_ok=True)
+
+AOI = (-91.10, 41.50, -91.00, 41.60) # xmin, ymin, xmax, ymax in EPSG:4326
+W = H = 256
+transform = from_bounds(*AOI, W, H)
+rng = np.random.default_rng(42)
+
+ys, xs = np.mgrid[0:H, 0:W]
+field = ((xs - 96) ** 2 + (ys - 160) ** 2) < 60**2 # circular vegetated field
+
+red = (1500 + 200 * rng.standard_normal((H, W))).clip(0,
10000).astype("uint16")
+nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)
+nir = np.where(field, nir + 4000, nir).astype("uint16")
+
+with rasterio.open(
+ f"{WORK}/scene.tif",
+ "w",
+ driver="GTiff",
+ tiled=True,
+ blockxsize=256,
+ blockysize=256,
+ height=H,
+ width=W,
+ count=2,
+ dtype="uint16",
+ crs="EPSG:4326",
+ transform=transform,
+) as dst:
+ dst.write(red, 1)
+ dst.set_band_description(1, "red")
+ dst.write(nir, 2)
+ dst.set_band_description(2, "nir")
+```
+
+### 2. Load with the `raster` data source
+
+The `raster` data source loads GeoTIFFs and automatically splits each file
into tiles. Every tile becomes a row in a DataFrame with a `Raster`-typed
column.
=== "Scala"
- ```scala
- var rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
+
+ ```scala
+ val rasterDf = sedona.read.format("raster").load(s"$WORK/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
=== "Java"
- ```java
- Dataset<Row> rasterDf =
sedona.read().format("raster").load("/some/path/*.tif");
- rasterDf.createOrReplaceTempView("rasterDf");
- rasterDf.show();
- ```
+
+ ```java
+ Dataset<Row> rasterDf = sedona.read().format("raster").load(WORK +
"/scene.tif");
+ rasterDf.createOrReplaceTempView("rasterDf");
+ rasterDf.show();
+ ```
=== "Python"
- ```python
- rasterDf = sedona.read.format("raster").load("/some/path/*.tif")
- rasterDf.createOrReplaceTempView("rasterDf")
- rasterDf.show()
- ```
-The output will look like this:
+ ```python
+ rasterDf = sedona.read.format("raster").load(f"{WORK}/scene.tif")
+ rasterDf.createOrReplaceTempView("rasterDf")
+ rasterDf.show()
+ ```
```
-+--------------------+---+---+----+
-| rast| x| y|name|
-+--------------------+---+---+----+
-|GridCoverage2D["g...| 0| 0| ...|
-|GridCoverage2D["g...| 1| 0| ...|
-|GridCoverage2D["g...| 2| 0| ...|
-...
++--------------------+---+---+----------+
+| rast| x| y| name|
++--------------------+---+---+----------+
+|GridCoverage2D["g...| 0| 0| scene.tif|
++--------------------+---+---+----------+
```
-The output contains the following columns:
+The columns are:
-- `rast`: The raster data in `Raster` format.
-- `x`: The 0-based x-coordinate of the tile. This column is only present when
retile is not disabled.
-- `y`: The 0-based y-coordinate of the tile. This column is only present when
retile is not disabled.
-- `name`: The name of the raster file.
+- `rast` — the raster, in Sedona's `Raster` type.
+- `x`, `y` — the 0-based tile index inside the source file (present when
tiling is enabled).
+- `name` — the source filename.
-### Tiling options
+The 256 × 256 scene fits in a single tile here, so you get one row. A
multi-gigabyte GeoTIFF would yield many rows — the same downstream SQL works in
both cases.
-By default, tiling is enabled (`retile = true`) and the tile size is
determined by the GeoTiff file's internal tiling scheme — you do not need to
specify `tileWidth` or `tileHeight`. It is recommended to use [Cloud Optimized
GeoTIFF (COG)](https://www.cogeo.org/) format for raster data since they
usually organize pixel data as square tiles.
+
-You can optionally override the tile size, or disable tiling entirely:
+See [Loading options](#loading-options) below for tile-size overrides,
recursive directory globs, and non-GeoTIFF formats such as NetCDF and Arc Grid.
-| Option | Default | Description |
-| :--- | :--- | :--- |
-| `retile` | `true` | Whether to enable tiling. Set to `false` to load the
entire raster as a single row. |
-| `tileWidth` | GeoTiff's internal tile width | Optional. Override the width
of each tile in pixels. |
-| `tileHeight` | Same as `tileWidth` if set, otherwise GeoTiff's internal tile
height | Optional. Override the height of each tile in pixels. |
-| `padWithNoData` | `false` | Pad the right and bottom tiles with NODATA
values if they are smaller than the specified tile size. |
-
-To override the tile size:
+### 3. Inspect metadata
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("tileWidth", "256")
- .option("tileHeight", "256")
- .load("/some/path/*.tif")
- )
- ```
-
-!!!note
- If the internal tiling scheme of raster data is not friendly for tiling,
the `raster` data source will throw an error, and you can disable automatic
tiling using `option("retile", "false")`, or specify the tile size manually to
workaround this issue. A better solution is to translate the raster data into
COG format using `gdal_translate` or other tools.
-
-### Loading raster files from directories
-
-The `raster` data source also works with Spark generic file source options,
such as `option("pathGlobFilter", "*.tif*")` and `option("recursiveFileLookup",
"true")`. For instance, you can load all the `.tif` files recursively in a
directory using:
-
-=== "Python"
- ```python
- rasterDf = (
- sedona.read.format("raster")
- .option("recursiveFileLookup", "true")
- .option("pathGlobFilter", "*.tif*")
- .load(path_to_raster_data_folder)
- )
- ```
+Confirm pixel dimensions, georeference, and CRS before processing:
-!!!tip
- When the loaded path ends with `/`, the `raster` data source will look up
raster files in the directory and all its subdirectories recursively. This is
equivalent to specifying a path without trailing `/` and setting
`option("recursiveFileLookup", "true")`.
+```python
+sedona.sql("""
+ SELECT RS_Width(rast) AS width,
+ RS_Height(rast) AS height,
+ RS_NumBands(rast) AS bands,
+ RS_SRID(rast) AS srid,
+ RS_GeoReference(rast) AS world_file
+ FROM rasterDf
+""").show(truncate=False)
+```
Review Comment:
The RS_GeoReference doc page (linked from the metadata reference table)
already has a full callout explaining the `\n`-escaping behavior and the
`collect()` workaround. Repeating it inline here would duplicate maintenance
for a side concern that doesn't affect understanding the walkthrough's main
flow. The escaped output in the example is consistent with what `show()`
actually produces, which is what readers will see if they run the snippet.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]