Re: [PR] [SPARK-56682][Geo][CONNECT][PYTHON][SQL] Extend the ST_AsBinary function to take optional endianness [spark]

via GitHub Mon, 04 May 2026 12:42:57 -0700


cloud-fan commented on code in PR #55631:
URL: https://github.com/apache/spark/pull/55631#discussion_r3184047829



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -666,6 +666,14 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase with ExecutionE
       summary = "")
   }
 
+  def stInvalidArgumentErrorInvalidEndiannessValue(
+      sqlFunction: String, endianness: String): SparkIllegalArgumentException 
= {
+    new SparkIllegalArgumentException(
+      errorClass = "ST_INVALID_ARGUMENT.INVALID_ENDIANNESS_VALUE",

Review Comment:
   **Critical: this error class is not defined.** 
`ST_INVALID_ARGUMENT.INVALID_ENDIANNESS_VALUE` has no entry in 
`common/utils/src/main/resources/error/error-conditions.json`. The result is 
visible in `st-functions.sql.out` — every invalid-endianness case throws 
`INTERNAL_ERROR` with `"Cannot find main error class 
'ST_INVALID_ARGUMENT.INVALID_ENDIANNESS_VALUE'"` as the captured output. CI is 
green only because the golden file matches the broken behavior. Add the entry, 
regenerate the golden file, and verify the new output is a real `ST_INVALID_…` 
error.
   
   Also note: every other ST error class (`ST_INVALID_SRID_VALUE`, 
`ST_INVALID_CRS_VALUE`, `ST_INVALID_ALGORITHM_VALUE`) is flat — using a 
subClass form here breaks that convention. Consider naming this 
`ST_INVALID_ENDIANNESS_VALUE` for consistency.



##########
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/Geography.java:
##########
@@ -130,8 +130,14 @@ public static Geography fromEwkt(byte[] ewkt) {
   /** Geography binary standard format converters: WKB and EWKB. */
 
   @Override
-  public byte[] toWkb() {
-    return toWkbInternal(DEFAULT_ENDIANNESS);
+  public byte[] toWkb(String endianness) {
+    // Check the endianness value.
+    if (!endianness.equalsIgnoreCase("NDR") && 
!endianness.equalsIgnoreCase("XDR")) {
+      throw QueryExecutionErrors.stInvalidArgumentErrorInvalidEndiannessValue(
+          "st_asbinary", endianness);

Review Comment:
   Layer concern: hardcoding `"st_asbinary"` as the SQL function name inside 
`Geography.toWkb(String)` couples this low-level Geo utility to a specific SQL 
function. Any future caller of `toWkb(String)` (a different SQL function, 
library code, an internal caller) would mislabel its errors. Validation should 
live at the expression layer (e.g., in `ST_AsBinary` via `checkInputDataTypes` 
for foldable cases, or in a shared `STUtils.parseEndianness` helper invoked 
from `STUtils.stAsBinary`). The same hardcoded name appears in 
`Geometry.java:137`.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala:
##########
@@ -666,6 +666,14 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase with ExecutionE
       summary = "")
   }
 
+  def stInvalidArgumentErrorInvalidEndiannessValue(
+      sqlFunction: String, endianness: String): SparkIllegalArgumentException 
= {
+    new SparkIllegalArgumentException(
+      errorClass = "ST_INVALID_ARGUMENT.INVALID_ENDIANNESS_VALUE",
+      messageParameters = Map("sqlFunction" -> sqlFunction, "e" -> endianness)

Review Comment:
   Use a descriptive parameter name instead of `e` — `endianness` matches the 
function arg and reads better in the error message template.



##########
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/Geometry.java:
##########
@@ -130,8 +130,14 @@ public static Geometry fromEwkt(byte[] ewkt) {
   /** Geometry binary standard format converters: WKB and EWKB. */
 
   @Override
-  public byte[] toWkb() {
-    return toWkbInternal(DEFAULT_ENDIANNESS);
+  public byte[] toWkb(String endianness) {
+    // Check the endianness value.
+    if (!endianness.equalsIgnoreCase("NDR") && 
!endianness.equalsIgnoreCase("XDR")) {
+      throw QueryExecutionErrors.stInvalidArgumentErrorInvalidEndiannessValue(
+          "st_asbinary", endianness);

Review Comment:
   Same as the comment on `Geography.java:137` — the validation block here is 
identical. After moving validation up to a shared helper, both 
`Geography.toWkb(String)` and `Geometry.toWkb(String)` can collapse to a single 
delegation to `toWkb(ByteOrder)`.



##########
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/STUtils.java:
##########
@@ -78,6 +82,16 @@ public static GeometryVal geographyToGeometry(GeographyVal 
geographyVal) {
     return toPhysVal(Geometry.fromBytes(geographyVal.getBytes()));
   }
 
+  // Cast geography to binary.
+  public static byte[] geographyToBinary(GeographyVal geographyVal) {

Review Comment:
   `geographyToBinary` and `geometryToBinary` (lines 86 and 91) have 
byte-for-byte the same body as the existing `stAsBinary(GeographyVal)` / 
`stAsBinary(GeometryVal)` 1-arg overloads (lines 130 and 138). Two ways to 
spell the same operation is more confusing than helpful — drop one set. The 
simplest path: keep the `stAsBinary` 1-arg overloads and have 
`ParquetWriteSupport` call those instead.



##########
sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/STUtils.java:
##########
@@ -63,7 +67,7 @@ public static GeographyVal geometryToGeography(GeometryVal 
geometryVal) {
     }
     // We also need to check whether the input geometry has coordinates in 
geography bounds.
     try {
-      byte[] wkb = stAsBinary(geometryVal);
+      byte[] wkb = stAsBinary(geometryVal, ENDIANNESS_NDR);

Review Comment:
   These internal callers (also lines 115, 123) call `stAsBinary(geo, 
ENDIANNESS_NDR)` for known-NDR uses, which forces `UTF8String → String → 
equalsIgnoreCase("NDR") / equalsIgnoreCase("XDR") → ByteOrder` for every 
invocation — pure overhead. Call `geographyToBinary` / `geometryToBinary` (or 
the 1-arg `stAsBinary` overloads) directly, like `ParquetWriteSupport.scala` 
already does.



##########
sql/core/src/test/scala/org/apache/spark/sql/STExpressionsSuite.scala:
##########
@@ -476,16 +476,21 @@ class STExpressionsSuite
 
   test("ST_AsBinary") {
     // Test data: WKB representation of POINT(1 2).
-    val wkb = 
Hex.unhex("0101000000000000000000F03F0000000000000040".getBytes())
-    val wkbLiteral = Literal.create(wkb, BinaryType)
+    val wkbNdr = 
Hex.unhex("0101000000000000000000F03F0000000000000040".getBytes())
+    val wkbXdr = 
Hex.unhex("00000000013FF00000000000004000000000000000".getBytes())
+    val wkbLiteral = Literal.create(wkbNdr, BinaryType)
     // ST_GeogFromWKB and ST_AsBinary.
     val geographyExpression = ST_GeogFromWKB(wkbLiteral)
     assert(geographyExpression.dataType.sameType(defaultGeographyType))
-    checkEvaluation(ST_AsBinary(geographyExpression), wkb)
+    checkEvaluation(new ST_AsBinary(geographyExpression), wkbNdr)
+    checkEvaluation(ST_AsBinary(geographyExpression, Literal.create("NDR")), 
wkbNdr)
+    checkEvaluation(ST_AsBinary(geographyExpression, Literal.create("XDR")), 
wkbXdr)
     // ST_GeomFromWKB and ST_AsBinary.
     val geometryExpression = new ST_GeomFromWKB(wkbLiteral)
     assert(geometryExpression.dataType.sameType(defaultGeometryType))
-    checkEvaluation(ST_AsBinary(geometryExpression), wkb)
+    checkEvaluation(new ST_AsBinary(geometryExpression), wkbNdr)

Review Comment:
   Two test cases would close real coverage gaps: (1) mixed-case endianness — 
the validation uses `equalsIgnoreCase`, so `'ndr'`, `'xdr'`, `'Ndr'` are 
accepted, but no test exercises that path; (2) NULL endianness — 
`ST_AsBinary(geo, NULL)` should propagate to NULL but isn't covered. Worth 
adding both alongside the existing `'NDR'`/`'XDR'` cases.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/st/stExpressions.scala:
##########
@@ -51,55 +52,69 @@ sealed trait GeospatialInputTypes extends 
ImplicitCastInputTypes {
 private[sql] object ExpressionDefaults {
   val DEFAULT_GEOGRAPHY_SRID: Int = Geography.DEFAULT_SRID
   val DEFAULT_GEOMETRY_SRID: Int = Geometry.DEFAULT_SRID
+  val DEFAULT_WKB_ENDIANNESS: String = "NDR"

Review Comment:
   This duplicates `STUtils.NDR` (line 32 of `STUtils.java`). Reference 
`STUtils.NDR` here instead of defining a parallel `DEFAULT_WKB_ENDIANNESS`.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/st/stExpressions.scala:
##########
@@ -51,55 +52,69 @@ sealed trait GeospatialInputTypes extends 
ImplicitCastInputTypes {
 private[sql] object ExpressionDefaults {
   val DEFAULT_GEOGRAPHY_SRID: Int = Geography.DEFAULT_SRID
   val DEFAULT_GEOMETRY_SRID: Int = Geometry.DEFAULT_SRID
+  val DEFAULT_WKB_ENDIANNESS: String = "NDR"
 }
 
 /** ST writer expressions. */
 
 /**
- * Returns the input GEOGRAPHY or GEOMETRY value in WKB format.
+ * Returns the input GEOGRAPHY or GEOMETRY value in WKB format using the 
specified endianness, if
+ * provided. If no endianness is provided, it defaults to little endian.
  * See 
https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
  * for more details on the WKB format.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(geo) - Returns the geospatial value (value of type GEOGRAPHY 
or GEOMETRY) "
-    + "in WKB format.",
+  usage = "_FUNC_(geo, endianness) - Returns the geospatial value (value of 
type GEOGRAPHY or "

Review Comment:
   Usage signature should mark `endianness` as optional, matching the 
convention used by peers like `ST_GeomFromWKB` (`_FUNC_(wkb[, srid])`).
   
   ```suggestion
     usage = "_FUNC_(geo[, endianness]) - Returns the geospatial value (value 
of type GEOGRAPHY or "
   ```



##########
sql/core/src/test/resources/sql-tests/results/st-functions.sql.out:
##########
@@ -768,6 +768,200 @@ org.apache.spark.SparkIllegalArgumentException
 }
 
 
+-- !query
+SELECT ST_AsBinary(NULL)
+-- !query schema
+struct<st_asbinary(NULL, NDR):binary>
+-- !query output
+NULL
+
+
+-- !query
+SELECT 
hex(ST_AsBinary(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040')))
+-- !query schema
+struct<hex(st_asbinary(st_geogfromwkb(X'0101000000000000000000F03F0000000000000040'),
 NDR)):string>
+-- !query output
+0101000000000000000000F03F0000000000000040
+
+
+-- !query
+SELECT 
hex(ST_AsBinary(ST_GeogFromWKB(X'00000000013FF00000000000004000000000000000'), 
'NDR'))
+-- !query schema
+struct<hex(st_asbinary(st_geogfromwkb(X'00000000013FF00000000000004000000000000000'),
 NDR)):string>
+-- !query output
+0101000000000000000000F03F0000000000000040
+
+
+-- !query
+SELECT 
hex(ST_AsBinary(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040'), 
'XDR'))
+-- !query schema
+struct<hex(st_asbinary(st_geogfromwkb(X'0101000000000000000000F03F0000000000000040'),
 XDR)):string>
+-- !query output
+00000000013FF00000000000004000000000000000
+
+
+-- !query
+SELECT 
hex(ST_AsBinary(ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040')))
+-- !query schema
+struct<hex(st_asbinary(st_geomfromwkb(X'0101000000000000000000F03F0000000000000040',
 0), NDR)):string>
+-- !query output
+0101000000000000000000F03F0000000000000040
+
+
+-- !query
+SELECT 
hex(ST_AsBinary(ST_GeomFromWKB(X'00000000013FF00000000000004000000000000000'), 
'NDR'))
+-- !query schema
+struct<hex(st_asbinary(st_geomfromwkb(X'00000000013FF00000000000004000000000000000',
 0), NDR)):string>

Review Comment:
   These `INTERNAL_ERROR` outputs (and the matching ones below for `'ABC'`, 
`'big-endian'`, and the table-level cases) reflect the missing error class 
flagged on `QueryExecutionErrors.scala`. After defining the 
`ST_INVALID_…ENDIANNESS_VALUE` entry in `error-conditions.json` and 
regenerating, these blocks should show the real ST error — not 
`INTERNAL_ERROR`. Worth verifying carefully when the golden file is regenerated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56682][Geo][CONNECT][PYTHON][SQL] Extend the ST_AsBinary function to take optional endianness [spark]

Reply via email to