This is an automated email from the ASF dual-hosted git repository.
jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/sedona.git
The following commit(s) were added to refs/heads/master by this push:
new b288870a96 [GH-2664] GeoParquet writer utilizes geometry SRID to
produce projjson CRS metadata (#2667)
b288870a96 is described below
commit b288870a96c84cb8e94d627cfb5e29791c4ce611
Author: Jia Yu <[email protected]>
AuthorDate: Sat Feb 21 03:27:04 2026 -0700
[GH-2664] GeoParquet writer utilizes geometry SRID to produce projjson CRS
metadata (#2667)
---
docs/tutorial/files/geoparquet-sedona-spark.md | 13 +-
pom.xml | 2 +-
.../geoparquet/GeoParquetMetaData.scala | 28 +++++
.../geoparquet/GeoParquetWriteSupport.scala | 56 ++++++++-
.../org/apache/sedona/sql/geoparquetIOTests.scala | 136 +++++++++++++++++++++
5 files changed, 225 insertions(+), 10 deletions(-)
diff --git a/docs/tutorial/files/geoparquet-sedona-spark.md
b/docs/tutorial/files/geoparquet-sedona-spark.md
index 833437f577..643d65e467 100644
--- a/docs/tutorial/files/geoparquet-sedona-spark.md
+++ b/docs/tutorial/files/geoparquet-sedona-spark.md
@@ -198,14 +198,19 @@ df.write.format("geoparquet")
The value of `geoparquet.crs` and `geoparquet.crs.<column_name>` can be one of
the following:
-* `"null"`: Explicitly setting `crs` field to `null`. This is the default
behavior.
+* `"null"`: Explicitly setting `crs` field to `null`. This is the default
behavior when geometry SRID is 0.
* `""` (empty string): Omit the `crs` field. This implies that the CRS is
[OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84) for CRS-aware
implementations.
* `"{...}"` (PROJJSON string): The `crs` field will be set as the PROJJSON
object representing the Coordinate Reference System (CRS) of the geometry. You
can find the PROJJSON string of a specific CRS from here: https://epsg.io/
(click the JSON option at the bottom of the page). You can also customize your
PROJJSON string as needed.
-Please note that Sedona currently cannot set/get a projjson string to/from a
CRS. Its geoparquet reader will ignore the projjson metadata and you will have
to set your CRS via [`ST_SetSRID`](../../api/sql/Function.md#st_setsrid) after
reading the file.
-Its geoparquet writer will not leverage the SRID field of a geometry so you
will have to always set the `geoparquet.crs` option manually when writing the
file, if you want to write a meaningful CRS field.
+### Automatic CRS from SRID
-Due to the same reason, Sedona geoparquet reader and writer do NOT check the
axis order (lon/lat or lat/lon) and assume they are handled by the users
themselves when writing / reading the files. You can always use
[`ST_FlipCoordinates`](../../api/sql/Function.md#st_flipcoordinates) to swap
the axis order of your geometries.
+When no `geoparquet.crs` option is explicitly provided, Sedona will
automatically derive the CRS PROJJSON from the SRID of the geometry column. For
example, if all geometries in a column have SRID 32632 (set via
[`ST_SetSRID`](../../api/sql/Function.md#st_setsrid)), the writer will
automatically produce the PROJJSON for EPSG:32632 in the GeoParquet metadata.
For SRID 4326, the CRS field is omitted since this is the GeoParquet default
(OGC:CRS84).
+
+* If the SRID is 0 (the default for geometries without an explicit SRID), the
`crs` field will be set to `null`.
+* If geometries in a column have mixed SRIDs, the `crs` field defaults to
`null`.
+* If an explicit `geoparquet.crs` or `geoparquet.crs.<column_name>` option is
provided, it always takes precedence over the SRID-derived CRS.
+
+Sedona geoparquet reader and writer do NOT check the axis order (lon/lat or
lat/lon) and assume they are handled by the users themselves when writing /
reading the files. You can always use
[`ST_FlipCoordinates`](../../api/sql/Function.md#st_flipcoordinates) to swap
the axis order of your geometries.
## Save GeoParquet with Covering Metadata
diff --git a/pom.xml b/pom.xml
index 4e8ec472e8..6a113fabd1 100644
--- a/pom.xml
+++ b/pom.xml
@@ -96,7 +96,7 @@
<scala-collection-compat.version>2.5.0</scala-collection-compat.version>
<geoglib.version>1.52</geoglib.version>
<caffeine.version>2.9.2</caffeine.version>
- <proj4sedona.version>0.0.5</proj4sedona.version>
+ <proj4sedona.version>0.0.6</proj4sedona.version>
<geotools.scope>provided</geotools.scope>
<!-- Because it's not in Maven central, make it provided by default -->
diff --git
a/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetMetaData.scala
b/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetMetaData.scala
index a108e3bafa..053adf6fc7 100644
---
a/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetMetaData.scala
+++
b/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetMetaData.scala
@@ -22,6 +22,7 @@ import scala.util.control.NonFatal
import org.apache.spark.sql.types.{DoubleType, FloatType, StructType}
import org.datasyslab.proj4sedona.core.Proj
+import org.datasyslab.proj4sedona.parser.CRSSerializer
import org.json4s.jackson.JsonMethods.parse
import org.json4s.jackson.compactJson
import org.json4s.{DefaultFormats, Extraction, JField, JNothing, JNull,
JObject, JValue}
@@ -203,6 +204,33 @@ object GeoParquetMetaData {
}
}
+ /**
+ * Convert an SRID to a PROJJSON JValue using proj4sedona.
+ *
+ * The generated PROJJSON includes an `id` field with the EPSG authority and
code, which enables
+ * round-trip SRID preservation when reading the GeoParquet file back.
+ *
+ * @param srid
+ * The SRID to convert (e.g., 4326 for WGS 84).
+ * @return
+ * Some(JValue) containing the PROJJSON if conversion succeeds, None if
the SRID is 0
+ * (unknown), 4326 (GeoParquet default CRS), or if conversion fails.
+ */
+ def sridToProjJson(srid: Int): Option[JValue] = {
+ if (srid == 0 || srid == DEFAULT_SRID) return None
+ try {
+ val proj = new Proj("EPSG:" + srid)
+ val projjsonStr = CRSSerializer.toProjJson(proj)
+ if (projjsonStr != null && projjsonStr.nonEmpty) {
+ Some(parse(projjsonStr))
+ } else {
+ None
+ }
+ } catch {
+ case NonFatal(_) => None
+ }
+ }
+
def createCoveringColumnMetadata(coveringColumnName: String, schema:
StructType): Covering = {
val coveringColumnIndex = schema.fieldIndex(coveringColumnName)
schema(coveringColumnIndex).dataType match {
diff --git
a/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetWriteSupport.scala
b/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetWriteSupport.scala
index 48655e5977..ca6f7e090e 100644
---
a/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetWriteSupport.scala
+++
b/spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetWriteSupport.scala
@@ -108,6 +108,7 @@ class GeoParquetWriteSupport extends
WriteSupport[InternalRow] with Logging {
private var geoParquetVersion: Option[String] = None
private var defaultGeoParquetCrs: Option[JValue] = None
+ private var userExplicitlySetDefaultCrs: Boolean = false
private val geoParquetColumnCrsMap: mutable.Map[String, Option[JValue]] =
mutable.Map.empty
private val geoParquetColumnCoveringMap: mutable.Map[String, Covering] =
mutable.Map.empty
private val generatedCoveringColumnOrdinals: mutable.Map[Int, Int] =
mutable.Map.empty
@@ -156,11 +157,16 @@ class GeoParquetWriteSupport extends
WriteSupport[InternalRow] with Logging {
}
defaultGeoParquetCrs = configuration.get(GEOPARQUET_CRS_KEY) match {
case null =>
- // If no CRS is specified, we write null to the crs metadata field.
This is for compatibility with
- // geopandas 0.10.0 and earlier versions, which requires crs field to
be present.
+ // If no CRS is specified, we default to deriving CRS from the
geometry SRID in finalizeWrite.
+ // This JNull value is used as a fallback when SRID is 0 or
SRID-to-PROJJSON conversion fails,
+ // maintaining compatibility with geopandas 0.10.0 and earlier
versions, which require a crs field.
Some(org.json4s.JNull)
- case "" => None
- case crs: String => Some(parse(crs))
+ case "" =>
+ userExplicitlySetDefaultCrs = true
+ None
+ case crs: String =>
+ userExplicitlySetDefaultCrs = true
+ Some(parse(crs))
}
geometryColumnInfoMap.keys.map(schema(_).name).foreach { name =>
Option(configuration.get(GEOPARQUET_CRS_KEY + "." + name)).foreach {
@@ -246,7 +252,21 @@ class GeoParquetWriteSupport extends
WriteSupport[InternalRow] with Logging {
columnInfo.bbox.maxX,
columnInfo.bbox.maxY)
} else Seq(0.0, 0.0, 0.0, 0.0)
- val crs = geoParquetColumnCrsMap.getOrElse(columnName,
defaultGeoParquetCrs)
+ val crs = geoParquetColumnCrsMap.getOrElse(
+ columnName, {
+ if (!userExplicitlySetDefaultCrs) {
+ // No explicit CRS option was provided; try to derive from
geometry SRID.
+ // For SRID 4326 (OGC:CRS84), omit CRS entirely per GeoParquet
spec default.
+ columnInfo.observedSrid match {
+ case Some(srid) if srid == GeoParquetMetaData.DEFAULT_SRID =>
None
+ case Some(srid) if srid > 0 =>
+
GeoParquetMetaData.sridToProjJson(srid).orElse(defaultGeoParquetCrs)
+ case _ => defaultGeoParquetCrs
+ }
+ } else {
+ defaultGeoParquetCrs
+ }
+ })
val covering = geoParquetColumnCoveringMap.get(columnName)
columnName -> GeometryFieldMetaData("WKB", geometryTypes, bbox, crs,
covering)
}.toMap
@@ -712,6 +732,22 @@ object GeoParquetWriteSupport {
// that are present in the column.
val seenGeometryTypes: mutable.Set[String] = mutable.Set.empty
+ // Track SRIDs seen in geometry values. A consistent SRID can be used to
+ // auto-generate CRS (projjson) metadata when no explicit CRS is provided:
+ // SRID 4326 results in omitted CRS (GeoParquet default), positive non-4326
+ // SRIDs generate PROJJSON, and SRID 0 or mixed SRIDs result in null CRS.
+ private var _srid: Int = -1 // -1 = no geometries seen yet
+ private var _mixedSrids: Boolean = false
+
+ /**
+ * Returns the observed SRID if all geometries had the same SRID, or None
if no geometries
+ * were seen or if mixed SRIDs were encountered.
+ */
+ def observedSrid: Option[Int] = {
+ if (_mixedSrids || _srid == -1) None
+ else Some(_srid)
+ }
+
def update(geom: Geometry): Unit = {
bbox.update(geom)
// In case of 3D geometries, a " Z" suffix gets added (e.g. ["Point Z"]).
@@ -721,6 +757,16 @@ object GeoParquetWriteSupport {
}
val geometryType = if (!hasZ) geom.getGeometryType else
geom.getGeometryType + " Z"
seenGeometryTypes.add(geometryType)
+
+ // Track SRID consistency across all geometries in this column
+ if (!_mixedSrids) {
+ val geomSrid = geom.getSRID
+ if (_srid == -1) {
+ _srid = geomSrid
+ } else if (_srid != geomSrid) {
+ _mixedSrids = true
+ }
+ }
}
}
diff --git
a/spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala
b/spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala
index 9f3bc97d27..3041757ed2 100644
--- a/spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala
+++ b/spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala
@@ -572,6 +572,142 @@ class geoparquetIOTests extends TestBaseScala with
BeforeAndAfterAll {
}
}
+ it("GeoParquet save should omit CRS for SRID 4326 per GeoParquet default")
{
+ val wktReader = new WKTReader()
+ val geom = wktReader.read("POINT (1 2)")
+ geom.setSRID(4326)
+ val testData = Seq(Row(1, geom))
+ val schema = StructType(
+ Seq(
+ StructField("id", IntegerType, nullable = false),
+ StructField("geometry", GeometryUDT(), nullable = false)))
+ val df = sparkSession.createDataFrame(testData.asJava,
schema).repartition(1)
+ val geoParquetSavePath = geoparquetoutputlocation +
"/gp_srid_4326_omit_crs.parquet"
+ df.write.format("geoparquet").mode("overwrite").save(geoParquetSavePath)
+ validateGeoParquetMetadata(geoParquetSavePath) { geo =>
+ val crs = geo \ "columns" \ "geometry" \ "crs"
+ // SRID 4326 = OGC:CRS84, the GeoParquet default. CRS field should be
omitted.
+ assert(
+ crs == org.json4s.JNothing,
+ s"Expected omitted CRS for SRID 4326 (GeoParquet default), got $crs")
+ }
+ // Round-trip: read back and verify SRID is preserved (omitted CRS ->
4326)
+ val df2 = sparkSession.read.format("geoparquet").load(geoParquetSavePath)
+ val geoms = df2.select("geometry").collect().map(_.getAs[Geometry](0))
+ geoms.foreach { g =>
+ assert(g.getSRID == 4326, s"Expected SRID 4326 after round-trip, got
${g.getSRID}")
+ }
+ }
+
+ it("GeoParquet save should auto-generate projjson from non-default SRID") {
+ val wktReader = new WKTReader()
+ val geom = wktReader.read("POINT (500000 4649776)")
+ geom.setSRID(32632)
+ val testData = Seq(Row(1, geom))
+ val schema = StructType(
+ Seq(
+ StructField("id", IntegerType, nullable = false),
+ StructField("geometry", GeometryUDT(), nullable = false)))
+ val df = sparkSession.createDataFrame(testData.asJava,
schema).repartition(1)
+ val geoParquetSavePath = geoparquetoutputlocation +
"/gp_auto_crs_from_srid_32632.parquet"
+ df.write.format("geoparquet").mode("overwrite").save(geoParquetSavePath)
+ validateGeoParquetMetadata(geoParquetSavePath) { geo =>
+ implicit val formats: org.json4s.Formats = org.json4s.DefaultFormats
+ val crs = geo \ "columns" \ "geometry" \ "crs"
+ // CRS should be auto-generated from SRID 32632
+ assert(
+ crs.isInstanceOf[org.json4s.JObject],
+ s"Expected JObject for auto-generated CRS, got $crs")
+ val authority = (crs \ "id" \ "authority").extract[String]
+ val code = (crs \ "id" \ "code").extract[Int]
+ assert(authority == "EPSG")
+ assert(code == 32632)
+ }
+ // Round-trip: read back and verify SRID is preserved
+ val df2 = sparkSession.read.format("geoparquet").load(geoParquetSavePath)
+ val geoms = df2.select("geometry").collect().map(_.getAs[Geometry](0))
+ geoms.foreach { g =>
+ assert(g.getSRID == 32632, s"Expected SRID 32632 after round-trip, got
${g.getSRID}")
+ }
+ }
+
+ it("GeoParquet save should keep crs null when geometry SRID is 0") {
+ val wktReader = new WKTReader()
+ val geom = wktReader.read("POINT (1 2)")
+ // SRID defaults to 0
+ assert(geom.getSRID == 0)
+ val testData = Seq(Row(1, geom))
+ val schema = StructType(
+ Seq(
+ StructField("id", IntegerType, nullable = false),
+ StructField("geometry", GeometryUDT(), nullable = false)))
+ val df = sparkSession.createDataFrame(testData.asJava,
schema).repartition(1)
+ val geoParquetSavePath = geoparquetoutputlocation +
"/gp_srid_zero_crs_null.parquet"
+ df.write.format("geoparquet").mode("overwrite").save(geoParquetSavePath)
+ validateGeoParquetMetadata(geoParquetSavePath) { geo =>
+ val crs = geo \ "columns" \ "geometry" \ "crs"
+ assert(crs == org.json4s.JNull, s"Expected null CRS for SRID 0, got
$crs")
+ }
+ }
+
+ it("GeoParquet save should use explicit CRS option over SRID-derived CRS")
{
+ val wktReader = new WKTReader()
+ val geom = wktReader.read("POINT (1 2)")
+ geom.setSRID(4326)
+ val testData = Seq(Row(1, geom))
+ val schema = StructType(
+ Seq(
+ StructField("id", IntegerType, nullable = false),
+ StructField("geometry", GeometryUDT(), nullable = false)))
+ val df = sparkSession.createDataFrame(testData.asJava,
schema).repartition(1)
+ val geoParquetSavePath =
+ geoparquetoutputlocation + "/gp_explicit_crs_overrides_srid.parquet"
+
+ // Explicitly set CRS to null — should override SRID-derived CRS
+ df.write
+ .format("geoparquet")
+ .option("geoparquet.crs", "null")
+ .mode("overwrite")
+ .save(geoParquetSavePath)
+ validateGeoParquetMetadata(geoParquetSavePath) { geo =>
+ val crs = geo \ "columns" \ "geometry" \ "crs"
+ assert(crs == org.json4s.JNull, s"Expected null CRS when explicitly
set, got $crs")
+ }
+
+ // Explicitly omit CRS — should override SRID-derived CRS
+ df.write
+ .format("geoparquet")
+ .option("geoparquet.crs", "")
+ .mode("overwrite")
+ .save(geoParquetSavePath)
+ validateGeoParquetMetadata(geoParquetSavePath) { geo =>
+ val crs = geo \ "columns" \ "geometry" \ "crs"
+ assert(
+ crs == org.json4s.JNothing,
+ s"Expected omitted CRS when explicitly set to empty, got $crs")
+ }
+ }
+
+ it("GeoParquet save should keep crs null for mixed SRIDs in one column") {
+ val wktReader = new WKTReader()
+ val geom1 = wktReader.read("POINT (1 2)")
+ geom1.setSRID(4326)
+ val geom2 = wktReader.read("POINT (3 4)")
+ geom2.setSRID(32632)
+ val testData = Seq(Row(1, geom1), Row(2, geom2))
+ val schema = StructType(
+ Seq(
+ StructField("id", IntegerType, nullable = false),
+ StructField("geometry", GeometryUDT(), nullable = false)))
+ val df = sparkSession.createDataFrame(testData.asJava,
schema).repartition(1)
+ val geoParquetSavePath = geoparquetoutputlocation +
"/gp_mixed_srid.parquet"
+ df.write.format("geoparquet").mode("overwrite").save(geoParquetSavePath)
+ validateGeoParquetMetadata(geoParquetSavePath) { geo =>
+ val crs = geo \ "columns" \ "geometry" \ "crs"
+ assert(crs == org.json4s.JNull, s"Expected null CRS for mixed SRIDs,
got $crs")
+ }
+ }
+
it("GeoParquet read should set SRID from PROJJSON CRS with EPSG
identifier") {
val df =
sparkSession.read.format("geoparquet").load(geoparquetdatalocation4)
val projjson =