Re: [PR] [GH-2886] Recognize Box2D columns as GeoParquet bbox covering columns [sedona]

via GitHub Thu, 07 May 2026 21:35:51 -0700


jiayuasu commented on code in PR #2921:
URL: https://github.com/apache/sedona/pull/2921#discussion_r3206296722



##########
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetMetaData.scala:
##########
@@ -236,6 +237,14 @@ object GeoParquetMetaData {
     schema(coveringColumnIndex).dataType match {
       case coveringColumnType: StructType =>
         coveringColumnTypeToCovering(coveringColumnName, coveringColumnType)
+      case _: Box2DUDT =>
+        // Box2DUDT exposes a struct<xmin, ymin, xmax, ymax: double> sqlType, 
which is the exact
+        // shape required by GeoParquet 1.1 bbox covering columns. Treat the 
underlying struct as
+        // the covering struct so users can write a Box2D column and have it 
referenced as a
+        // covering column in GeoParquet metadata without any manual struct 
construction.
+        coveringColumnTypeToCovering(
+          coveringColumnName,
+          new Box2DUDT().sqlType.asInstanceOf[StructType])
       case _ =>

Review Comment:
   Done in 776c5b93 — bound the case to `udt: Box2DUDT` and use `udt.sqlType` 
with a `match` on `StructType` (no allocation, no `asInstanceOf`, clear failure 
mode if the sqlType shape ever changes).
   
   On the second suggestion (generalize to any `UserDefinedType` whose 
`sqlType` is a `StructType`): I would push back on that. Other UDTs may have 
struct-shaped `sqlType`s that are not valid bbox covering columns — e.g., a 
future ImageUDT or RasterMetadataUDT could have a struct of metadata fields, 
and the call to `validateField("xmin")` inside `coveringColumnTypeToCovering` 
would throw a confusing error instead of the clear "not a struct type" one. 
Keeping the match specific to `Box2DUDT` makes the contract explicit; we can 
add other UDTs case-by-case as they appear.



##########
spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala:
##########
@@ -1021,6 +1021,53 @@ class geoparquetIOTests extends TestBaseScala with 
BeforeAndAfterAll {
       }
     }
 
+    it("GeoParquet supports writing covering metadata from a Box2D column") {
+      // User-provided Box2D column referenced via the geoparquet.covering 
option.
+      val df = sparkSession
+        .range(0, 100)
+        .toDF("id")
+        .withColumn("id", expr("CAST(id AS DOUBLE)"))
+        .withColumn("geometry", expr("ST_Point(id, id + 1)"))
+        .withColumn("test_cov", expr("ST_Box2D(geometry)"))
+      val geoParquetSavePath = geoparquetoutputlocation + 
"/gp_with_box2d_covering.parquet"
+      df.write
+        .format("geoparquet")
+        .option("geoparquet.covering.geometry", "test_cov")
+        .mode("overwrite")
+        .save(geoParquetSavePath)

Review Comment:
   The hard-coded path under `geoparquetoutputlocation` is the existing 
convention for the GeoParquet covering tests in this file (line 992 onward, 
`gp_with_covering_metadata.parquet` etc.). Using a different pattern just for 
the new tests would be inconsistent. If we want to move the suite to per-test 
unique dirs, that should be a focused refactor across all of these tests rather 
than a one-off here.



##########
spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala:
##########
@@ -1021,6 +1021,53 @@ class geoparquetIOTests extends TestBaseScala with 
BeforeAndAfterAll {
       }
     }
 
+    it("GeoParquet supports writing covering metadata from a Box2D column") {
+      // User-provided Box2D column referenced via the geoparquet.covering 
option.
+      val df = sparkSession
+        .range(0, 100)
+        .toDF("id")
+        .withColumn("id", expr("CAST(id AS DOUBLE)"))
+        .withColumn("geometry", expr("ST_Point(id, id + 1)"))
+        .withColumn("test_cov", expr("ST_Box2D(geometry)"))
+      val geoParquetSavePath = geoparquetoutputlocation + 
"/gp_with_box2d_covering.parquet"
+      df.write
+        .format("geoparquet")
+        .option("geoparquet.covering.geometry", "test_cov")
+        .mode("overwrite")
+        .save(geoParquetSavePath)
+      validateGeoParquetMetadata(geoParquetSavePath) { geo =>
+        implicit val formats: org.json4s.Formats = org.json4s.DefaultFormats
+        val coveringJsValue = geo \ "columns" \ "geometry" \ "covering"
+        val covering = coveringJsValue.extract[Covering]
+        assert(covering.bbox.xmin == Seq("test_cov", "xmin"))
+        assert(covering.bbox.ymin == Seq("test_cov", "ymin"))
+        assert(covering.bbox.xmax == Seq("test_cov", "xmax"))
+        assert(covering.bbox.ymax == Seq("test_cov", "ymax"))
+      }
+    }
+
+    it("GeoParquet auto populates covering metadata for a Box2D <geom>_bbox 
column") {
+      // Auto-detect path: when a column named <geom>_bbox is a Box2D, reuse 
it as the
+      // covering column instead of synthesizing a separate float64 struct.
+      val df = sparkSession
+        .range(0, 100)
+        .toDF("id")
+        .withColumn("id", expr("CAST(id AS DOUBLE)"))
+        .withColumn("geometry", expr("ST_Point(id, id + 1)"))
+        .withColumn("geometry_bbox", expr("ST_Box2D(geometry)"))
+      val geoParquetSavePath = geoparquetoutputlocation + 
"/gp_box2d_auto_covering.parquet"
+      df.write.format("geoparquet").mode("overwrite").save(geoParquetSavePath)

Review Comment:
   Same as the previous comment — sticking with the existing hard-coded 
`geoparquetoutputlocation` convention used by the other covering tests in this 
file. Refactoring to per-test unique dirs would be a separate suite-wide change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GH-2886] Recognize Box2D columns as GeoParquet bbox covering columns [sedona]

Reply via email to