jiayuasu commented on code in PR #2921:
URL: https://github.com/apache/sedona/pull/2921#discussion_r3206296722
##########
spark/common/src/main/scala/org/apache/spark/sql/execution/datasources/geoparquet/GeoParquetMetaData.scala:
##########
@@ -236,6 +237,14 @@ object GeoParquetMetaData {
schema(coveringColumnIndex).dataType match {
case coveringColumnType: StructType =>
coveringColumnTypeToCovering(coveringColumnName, coveringColumnType)
+ case _: Box2DUDT =>
+ // Box2DUDT exposes a struct<xmin, ymin, xmax, ymax: double> sqlType,
which is the exact
+ // shape required by GeoParquet 1.1 bbox covering columns. Treat the
underlying struct as
+ // the covering struct so users can write a Box2D column and have it
referenced as a
+ // covering column in GeoParquet metadata without any manual struct
construction.
+ coveringColumnTypeToCovering(
+ coveringColumnName,
+ new Box2DUDT().sqlType.asInstanceOf[StructType])
case _ =>
Review Comment:
Done in 776c5b93 — bound the case to `udt: Box2DUDT` and use `udt.sqlType`
with a `match` on `StructType` (no allocation, no `asInstanceOf`, clear failure
mode if the sqlType shape ever changes).
On the second suggestion (generalize to any `UserDefinedType` whose
`sqlType` is a `StructType`): I would push back on that. Other UDTs may have
struct-shaped `sqlType`s that are not valid bbox covering columns — e.g., a
future ImageUDT or RasterMetadataUDT could have a struct of metadata fields,
and the call to `validateField("xmin")` inside `coveringColumnTypeToCovering`
would throw a confusing error instead of the clear "not a struct type" one.
Keeping the match specific to `Box2DUDT` makes the contract explicit; we can
add other UDTs case-by-case as they appear.
##########
spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala:
##########
@@ -1021,6 +1021,53 @@ class geoparquetIOTests extends TestBaseScala with
BeforeAndAfterAll {
}
}
+ it("GeoParquet supports writing covering metadata from a Box2D column") {
+ // User-provided Box2D column referenced via the geoparquet.covering
option.
+ val df = sparkSession
+ .range(0, 100)
+ .toDF("id")
+ .withColumn("id", expr("CAST(id AS DOUBLE)"))
+ .withColumn("geometry", expr("ST_Point(id, id + 1)"))
+ .withColumn("test_cov", expr("ST_Box2D(geometry)"))
+ val geoParquetSavePath = geoparquetoutputlocation +
"/gp_with_box2d_covering.parquet"
+ df.write
+ .format("geoparquet")
+ .option("geoparquet.covering.geometry", "test_cov")
+ .mode("overwrite")
+ .save(geoParquetSavePath)
Review Comment:
The hard-coded path under `geoparquetoutputlocation` is the existing
convention for the GeoParquet covering tests in this file (line 992 onward,
`gp_with_covering_metadata.parquet` etc.). Using a different pattern just for
the new tests would be inconsistent. If we want to move the suite to per-test
unique dirs, that should be a focused refactor across all of these tests rather
than a one-off here.
##########
spark/common/src/test/scala/org/apache/sedona/sql/geoparquetIOTests.scala:
##########
@@ -1021,6 +1021,53 @@ class geoparquetIOTests extends TestBaseScala with
BeforeAndAfterAll {
}
}
+ it("GeoParquet supports writing covering metadata from a Box2D column") {
+ // User-provided Box2D column referenced via the geoparquet.covering
option.
+ val df = sparkSession
+ .range(0, 100)
+ .toDF("id")
+ .withColumn("id", expr("CAST(id AS DOUBLE)"))
+ .withColumn("geometry", expr("ST_Point(id, id + 1)"))
+ .withColumn("test_cov", expr("ST_Box2D(geometry)"))
+ val geoParquetSavePath = geoparquetoutputlocation +
"/gp_with_box2d_covering.parquet"
+ df.write
+ .format("geoparquet")
+ .option("geoparquet.covering.geometry", "test_cov")
+ .mode("overwrite")
+ .save(geoParquetSavePath)
+ validateGeoParquetMetadata(geoParquetSavePath) { geo =>
+ implicit val formats: org.json4s.Formats = org.json4s.DefaultFormats
+ val coveringJsValue = geo \ "columns" \ "geometry" \ "covering"
+ val covering = coveringJsValue.extract[Covering]
+ assert(covering.bbox.xmin == Seq("test_cov", "xmin"))
+ assert(covering.bbox.ymin == Seq("test_cov", "ymin"))
+ assert(covering.bbox.xmax == Seq("test_cov", "xmax"))
+ assert(covering.bbox.ymax == Seq("test_cov", "ymax"))
+ }
+ }
+
+ it("GeoParquet auto populates covering metadata for a Box2D <geom>_bbox
column") {
+ // Auto-detect path: when a column named <geom>_bbox is a Box2D, reuse
it as the
+ // covering column instead of synthesizing a separate float64 struct.
+ val df = sparkSession
+ .range(0, 100)
+ .toDF("id")
+ .withColumn("id", expr("CAST(id AS DOUBLE)"))
+ .withColumn("geometry", expr("ST_Point(id, id + 1)"))
+ .withColumn("geometry_bbox", expr("ST_Box2D(geometry)"))
+ val geoParquetSavePath = geoparquetoutputlocation +
"/gp_box2d_auto_covering.parquet"
+ df.write.format("geoparquet").mode("overwrite").save(geoParquetSavePath)
Review Comment:
Same as the previous comment — sticking with the existing hard-coded
`geoparquetoutputlocation` convention used by the other covering tests in this
file. Refactoring to per-test unique dirs would be a separate suite-wide change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]