qlong opened a new pull request, #53458:
URL: https://github.com/apache/spark/pull/53458
This PR adds UTF-8 validation when casting from BinaryType to StringType to
prevent data corruption. Spark's StringType is UTF-8 by design, and invalid
UTF-8 violates the type system, causing irreversible data loss.
Added spark.sql.castBinaryToString.validateUtf8 config (default=true):
- When true (default): Validates UTF-8 during cast
* ANSI mode: Throws exception on invalid UTF-8
* LEGACY mode: Returns NULL on invalid UTF-8
* TRY mode: Returns NULL on invalid UTF-8
- When false: Preserves old behavior (not recommended)
Aligns with major databases' UTF-8 validation behavior:
- Snowflake: TO_VARCHAR throws, TRY_TO_VARCHAR returns NULL
- PostgreSQL: Rejects invalid UTF-8 by default
- Redshift: Throws error when casting invalid bytes to VARCHAR
BREAKING CHANGE: With default config, invalid UTF-8 casts now return NULL
(LEGACY/TRY) or throw (ANSI) instead of silently corrupting data. Users needing
old behavior can set validateUtf8=false.
### What changes were proposed in this pull request?
This PR adds UTF-8 validation when casting from BinaryType to StringType to
prevent data corruption. Spark's StringType is UTF-8 by design, and invalid
UTF-8 violates the type system, causing irreversible data loss.
Added spark.sql.castBinaryToString.validateUtf8 config (default=true):
- When true (default): Validates UTF-8 during cast
* ANSI mode: Throws exception on invalid UTF-8
* LEGACY mode: Returns NULL on invalid UTF-8
* TRY mode: Returns NULL on invalid UTF-8
- When false: Preserves old behavior (not recommended)
**References:**
- **Snowflake**: `TO_VARCHAR` throws error on invalid UTF-8,
`TRY_TO_VARCHAR` returns NULL
- **PostgreSQL**: Rejects invalid UTF-8 by default, validates on input
- **Redshift**: Throws error when casting invalid bytes to VARCHAR
This PR brings Spark's behavior in line with those engines for UTF-8
validation.
### Why are the changes needed?
This PR fixes
[SPARK-54586](https://issues.apache.org/jira/browse/SPARK-54586).
**The fundamental problem:**
```scala
// Before this PR: Invalid UTF-8 bytes are silently allowed in StringType
val bytes = Array[Byte](0x80.toByte, 0xFF.toByte) // Invalid UTF-8
// Cast creates UTF8String containing raw bytes [0x80, 0xFF]
// These raw invalid bytes violate UTF-8 spec but are stored anyway
```
Spark's `StringType` is UTF-8 by definition. Invalid UTF-8 in StringType:
1. **Violates type system**: StringType should only contain valid UTF-8
2. **Causes data loss**: When other tools read the data, they may replace
with U+FFFD (changing 2 bytes to 6 bytes) or fail
3. **Breaks interoperability**: Other tools may fail reading the data
entirely as Parquet, ORC, JSON all require valid UTF-8 for strings
**Current behavior (bug):**
```scala
val df = Seq(Array[Byte](0x80.toByte, 0xFF.toByte)).toDF("binary")
df.selectExpr("cast(binary as string) as str")
.write.parquet("/tmp/output")
// Result: Parquet file contains invalid UTF-8 bytes [0x80, 0xFF] in STRING
column
// Other tools reading this file may:
// - Replace with U+FFFD replacement characters (data corruption)
// - Fail to read the file
// - Produce undefined behavior
```
**With this fix:**
```scala
// Default behavior (LEGACY mode)
spark.sql("SELECT CAST(X'80' AS STRING)").show() // NULL
// ANSI mode
spark.conf.set("spark.sql.ansi.enabled", true)
spark.sql("SELECT CAST(X'80' AS STRING)") // Throws SparkRuntimeException
// Legacy behavior still available
spark.conf.set("spark.sql.castBinaryToString.validateUtf8", false)
spark.sql("SELECT CAST(X'80' AS STRING)").show() // Old behavior
```
### Does this PR introduce _any_ user-facing change?
**Yes.**
With default config (`validateUtf8=true`), casting invalid UTF-8 to string
now:
- **LEGACY/TRY mode**: Returns NULL instead of corrupted string
- **ANSI mode**: Throws `SparkRuntimeException` with error code
`CAST_INVALID_INPUT`
**Previous behavior (before this PR):**
```scala
scala> spark.sql("SELECT CAST(X'80' AS STRING)").show()
+---------------------+
|CAST(X'80' AS STRING)|
+---------------------+
| �| // Invalid UTF-8 byte (raw 0x80 stored, displays as
�)
+---------------------+
```
**New behavior (after this PR, default config):**
```scala
// LEGACY mode (default)
scala> spark.sql("SELECT CAST(X'80' AS STRING)").show()
+------------------------+
|CAST(X'80' AS STRING) |
+------------------------+
|NULL | // NULL instead of corrupted data
+------------------------+
// ANSI mode
scala> spark.conf.set("spark.sql.ansi.enabled", true)
scala> spark.sql("SELECT CAST(X'80' AS STRING)")
org.apache.spark.SparkRuntimeException: [CAST_INVALID_INPUT] The value \x80
of the type "BINARY" cannot be cast to "STRING" because it is malformed.
```
**Migration path:**
- Should use the new default behavior (prevents data corruption)
- Users needing old behavior: set
`spark.sql.castBinaryToString.validateUtf8=false`
- Use `TRY_CAST(binary AS STRING)` for safe conversion that returns NULL on
failure
### How was this patch tested?
**Verified interactively with spark-shell**
**Added new tests covering both default behavior and legacy mode.**
**Test updates for existing tests:**
Several existing tests use invalid UTF-8 and required updates:
- **`hll.sql`, `thetasketch.sql`**: Added `SET
spark.sql.castBinaryToString.validateUtf8=false` at the beginning of each file
to preserve original collation test coverage with invalid UTF-8 casts.
- **`listagg.sql`, `string-functions.sql`**: Tests both default
(validateUtf8=true) and legacy (validateUtf8=false) behaviors using inline
`SET` commands to toggle validation for specific test sections, providing
comprehensive coverage of both modes.
- **`StringFunctionsSuite.scala`**: Wrapped invalid UTF-8 test cases for
`is_valid_utf8`, `make_valid_utf8`, `validate_utf8`, and `try_validate_utf8`
with `withSQLConf(SQLConf.VALIDATE_BINARY_TO_STRING_CAST.key -> "false")`. This
is necessary because these functions test invalid UTF-8 validation logic, but
with the new default behavior, the BinaryType → StringType cast validates UTF-8
first and throws (ANSI) or returns NULL (LEGACY) before the functions can
execute. Disabling cast validation allows the functions to receive and test
their own validation logic.
- **`ThriftServerQueryTestSuite.scala`**: Added `needsInvalidUtf8` set
containing `hll.sql` and `thetasketch.sql`. These files automatically get UTF-8
validation disabled when run through ThriftServer.
- **`ExpressionInfoSuite.scala`**: Added UTF-8 validation functions
(`IsValidUTF8`, `MakeValidUTF8`, `ValidateUTF8`, `TryValidateUTF8`) to the
`ignoreSet` for the "check-outputs-of-expression-examples" test. Their
documentation examples use invalid UTF-8 (e.g., `x'80'`), which would fail with
the new default before examples can execute. The functions are properly tested
via unit tests and SQL tests with explicit config.
- **`gen-sql-functions-docs.py`**: Sets
`spark.sql.castBinaryToString.validateUtf8=false` to allow generation of
documentation for UTF-8 validation functions with invalid UTF-8 examples.
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]