llphxd opened a new pull request, #55820:
URL: https://github.com/apache/spark/pull/55820
### What changes were proposed in this pull request?
This PR adds a new SQL configuration,
`spark.sql.charTrimTrailingSpacesOnRead`, to trim trailing spaces from
`CHAR(N)` columns and fields when reading table data.
The new configuration is disabled by default, so the existing Spark behavior
is preserved. When it is enabled, it takes precedence over
`spark.sql.readSideCharPadding`.
This is intended to provide an opt-in compatibility mode for systems such as
MySQL, where `CHAR` values are commonly returned without trailing spaces unless
`PAD_CHAR_TO_FULL_LENGTH` is enabled.
### Why are the changes needed?
Spark currently enforces fixed-length `CHAR(N)` semantics by padding `CHAR`
values on write, and by applying read-side padding when
`spark.sql.readSideCharPadding` is enabled.
I tested this behavior across several Spark versions with MySQL tables. In
Spark 3.3.1 and Spark 3.4.4, MySQL `CHAR` and `VARCHAR` columns were simply
treated as Spark `STRING`, so trailing-space behavior was closer to the old
string-based behavior. In Spark 3.5.2 and Spark 4.0.1, Spark maps MySQL
character types to more standard and stricter Spark `CHAR` types, which can
expose behavior differences for `CHAR` columns compared with older Spark
versions.
This makes migration or upgrade harder for workloads that rely on the
previous string-like behavior or on MySQL's default `CHAR` retrieval behavior,
where trailing spaces are removed on read. Users may otherwise need to wrap
many `CHAR` columns with `rtrim()` manually in queries.
This PR provides an opt-in configuration to make this behavior easier to
control without changing Spark's default semantics.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds a new SQL configuration:
```text
spark.sql.charTrimTrailingSpacesOnRead
The default value is false, so existing behavior is unchanged.
When set to true, Spark trims trailing spaces from CHAR(N) columns and
fields when reading table data. The option does not affect VARCHAR or STRING,
and it does not change write-side CHAR/VARCHAR length checks.
Example:
SET spark.sql.charTrimTrailingSpacesOnRead=true;
CREATE TABLE t (c CHAR(4), v VARCHAR(4), s STRING) USING parquet;
INSERT INTO t VALUES ('12', '12 ', '12 ');
SELECT c, length(c), v, length(v), s, length(s) FROM t;
With the new configuration enabled, the CHAR(4) value is returned without
trailing spaces, while VARCHAR and STRING remain unchanged.
### How was this patch tested?
Added test coverage in CharVarcharTestSuite for trimming trailing spaces
from CHAR columns and nested CHAR fields on read, while keeping VARCHAR and
STRING unchanged.
Tested with:
./dev/scalastyle
build/sbt "sql/testOnly *CharVarcharTestSuite"
### Was this patch authored or co-authored using generative AI tooling?
Assisted by ChatGPT-5.5
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]