qlong opened a new pull request, #53458:
URL: https://github.com/apache/spark/pull/53458

   This PR adds UTF-8 validation when casting from BinaryType to StringType to 
prevent data corruption. Spark's StringType is UTF-8 by design, and invalid 
UTF-8 violates the type system, causing irreversible data loss.
   
   Added spark.sql.castBinaryToString.validateUtf8 config (default=true):
   - When true (default): Validates UTF-8 during cast
     * ANSI mode: Throws exception on invalid UTF-8
     * LEGACY mode: Returns NULL on invalid UTF-8
     * TRY mode: Returns NULL on invalid UTF-8
   - When false: Preserves old behavior (not recommended)
   
   Aligns with major databases' UTF-8 validation behavior:
   - Snowflake: TO_VARCHAR throws, TRY_TO_VARCHAR returns NULL
   - PostgreSQL: Rejects invalid UTF-8 by default
   - Redshift: Throws error when casting invalid bytes to VARCHAR
   
   BREAKING CHANGE: With default config, invalid UTF-8 casts now return NULL 
(LEGACY/TRY) or throw (ANSI) instead of silently corrupting data. Users needing 
old behavior can set validateUtf8=false.
   
   ### What changes were proposed in this pull request?
   
   This PR adds UTF-8 validation when casting from BinaryType to StringType to 
prevent data corruption. Spark's StringType is UTF-8 by design, and invalid 
UTF-8 violates the type system, causing irreversible data loss.
   
   Added spark.sql.castBinaryToString.validateUtf8 config (default=true):
   - When true (default): Validates UTF-8 during cast
     * ANSI mode: Throws exception on invalid UTF-8
     * LEGACY mode: Returns NULL on invalid UTF-8
     * TRY mode: Returns NULL on invalid UTF-8
   - When false: Preserves old behavior (not recommended)
   
   **References:**
   - **Snowflake**: `TO_VARCHAR` throws error on invalid UTF-8, 
`TRY_TO_VARCHAR` returns NULL
   - **PostgreSQL**: Rejects invalid UTF-8 by default, validates on input
   - **Redshift**: Throws error when casting invalid bytes to VARCHAR
   
   This PR brings Spark's behavior in line with those engines for UTF-8 
validation.
   
   
   
   ### Why are the changes needed?
   
   This PR fixes 
[SPARK-54586](https://issues.apache.org/jira/browse/SPARK-54586).
   
   **The fundamental problem:**
   ```scala
   // Before this PR: Invalid UTF-8 bytes are silently allowed in StringType
   val bytes = Array[Byte](0x80.toByte, 0xFF.toByte)  // Invalid UTF-8
   // Cast creates UTF8String containing raw bytes [0x80, 0xFF]
   // These raw invalid bytes violate UTF-8 spec but are stored anyway
   ```
   
   Spark's `StringType` is UTF-8 by definition. Invalid UTF-8 in StringType:
   1. **Violates type system**: StringType should only contain valid UTF-8
   2. **Causes data loss**: When other tools read the data, they may replace 
with U+FFFD (changing 2 bytes to 6 bytes) or fail
   3.  **Breaks interoperability**: Other tools may fail reading the data 
entirely as Parquet, ORC, JSON all require valid UTF-8 for strings
   
   **Current behavior (bug):**
   ```scala
   val df = Seq(Array[Byte](0x80.toByte, 0xFF.toByte)).toDF("binary")
   df.selectExpr("cast(binary as string) as str")
     .write.parquet("/tmp/output")
   // Result: Parquet file contains invalid UTF-8 bytes [0x80, 0xFF] in STRING 
column
   // Other tools reading this file may:
   //   - Replace with U+FFFD replacement characters (data corruption)
   //   - Fail to read the file
   //   - Produce undefined behavior
   ```
   
   **With this fix:**
   ```scala
   // Default behavior (LEGACY mode)
   spark.sql("SELECT CAST(X'80' AS STRING)").show()  // NULL
   
   // ANSI mode
   spark.conf.set("spark.sql.ansi.enabled", true)
   spark.sql("SELECT CAST(X'80' AS STRING)")  // Throws SparkRuntimeException
   
   // Legacy behavior still available
   spark.conf.set("spark.sql.castBinaryToString.validateUtf8", false)
   spark.sql("SELECT CAST(X'80' AS STRING)").show()  // Old behavior
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   **Yes.**
   
   With default config (`validateUtf8=true`), casting invalid UTF-8 to string 
now:
   - **LEGACY/TRY mode**: Returns NULL instead of corrupted string
   - **ANSI mode**: Throws `SparkRuntimeException` with error code 
`CAST_INVALID_INPUT`
   
   **Previous behavior (before this PR):**
   ```scala
   scala> spark.sql("SELECT CAST(X'80' AS STRING)").show()
   +---------------------+
   |CAST(X'80' AS STRING)|
   +---------------------+
   |                    �|  // Invalid UTF-8 byte (raw 0x80 stored, displays as 
�)
   +---------------------+
   ```
   
   **New behavior (after this PR, default config):**
   ```scala
   // LEGACY mode (default)
   scala> spark.sql("SELECT CAST(X'80' AS STRING)").show()
   +------------------------+
   |CAST(X'80' AS STRING)   |
   +------------------------+
   |NULL                    |  // NULL instead of corrupted data
   +------------------------+
   
   // ANSI mode
   scala> spark.conf.set("spark.sql.ansi.enabled", true)
   scala> spark.sql("SELECT CAST(X'80' AS STRING)")
   org.apache.spark.SparkRuntimeException: [CAST_INVALID_INPUT] The value \x80 
of the type "BINARY" cannot be cast to "STRING" because it is malformed.
   ```
   
   **Migration path:**
   - Should use the new default behavior (prevents data corruption)
   - Users needing old behavior: set 
`spark.sql.castBinaryToString.validateUtf8=false`
   - Use `TRY_CAST(binary AS STRING)` for safe conversion that returns NULL on 
failure
   
   
   ### How was this patch tested?
   
   **Verified interactively with spark-shell**
   
   **Added new tests covering both default behavior and legacy mode.**
   
   **Test updates for existing tests:**
   
   Several existing tests use invalid UTF-8 and required updates:
   
   - **`hll.sql`, `thetasketch.sql`**: Added `SET 
spark.sql.castBinaryToString.validateUtf8=false` at the beginning of each file 
to preserve original collation test coverage with invalid UTF-8 casts.
   
   - **`listagg.sql`, `string-functions.sql`**: Tests both default 
(validateUtf8=true) and legacy (validateUtf8=false) behaviors using inline 
`SET` commands to toggle validation for specific test sections, providing 
comprehensive coverage of both modes.
   
   - **`StringFunctionsSuite.scala`**: Wrapped invalid UTF-8 test cases for 
`is_valid_utf8`, `make_valid_utf8`, `validate_utf8`, and `try_validate_utf8` 
with `withSQLConf(SQLConf.VALIDATE_BINARY_TO_STRING_CAST.key -> "false")`. This 
is necessary because these functions test invalid UTF-8 validation logic, but 
with the new default behavior, the BinaryType → StringType cast validates UTF-8 
first and throws (ANSI) or returns NULL (LEGACY) before the functions can 
execute. Disabling cast validation allows the functions to receive and test 
their own validation logic.
   
   - **`ThriftServerQueryTestSuite.scala`**: Added `needsInvalidUtf8` set 
containing `hll.sql` and `thetasketch.sql`. These files automatically get UTF-8 
validation disabled when run through ThriftServer.
   
   - **`ExpressionInfoSuite.scala`**: Added UTF-8 validation functions 
(`IsValidUTF8`, `MakeValidUTF8`, `ValidateUTF8`, `TryValidateUTF8`) to the 
`ignoreSet` for the "check-outputs-of-expression-examples" test. Their 
documentation examples use invalid UTF-8 (e.g., `x'80'`), which would fail with 
the new default before examples can execute. The functions are properly tested 
via unit tests and SQL tests with explicit config.
   
   - **`gen-sql-functions-docs.py`**: Sets 
`spark.sql.castBinaryToString.validateUtf8=false` to allow generation of 
documentation for UTF-8 validation functions with invalid UTF-8 examples.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to