llphxd opened a new pull request, #55820:
URL: https://github.com/apache/spark/pull/55820

   
   
   ### What changes were proposed in this pull request?
   This PR adds a new SQL configuration, 
`spark.sql.charTrimTrailingSpacesOnRead`, to trim trailing spaces from 
`CHAR(N)` columns and fields when reading table data.
   The new configuration is disabled by default, so the existing Spark behavior 
is preserved. When it is enabled, it takes precedence over 
`spark.sql.readSideCharPadding`.
   This is intended to provide an opt-in compatibility mode for systems such as 
MySQL, where `CHAR` values are commonly returned without trailing spaces unless 
`PAD_CHAR_TO_FULL_LENGTH` is enabled.
   
   
   ### Why are the changes needed?
   Spark currently enforces fixed-length `CHAR(N)` semantics by padding `CHAR` 
values on write, and by applying read-side padding when 
`spark.sql.readSideCharPadding` is enabled.
   I tested this behavior across several Spark versions with MySQL tables. In 
Spark 3.3.1 and Spark 3.4.4, MySQL `CHAR` and `VARCHAR` columns were simply 
treated as Spark `STRING`, so trailing-space behavior was closer to the old 
string-based behavior. In Spark 3.5.2 and Spark 4.0.1, Spark maps MySQL 
character types to more standard and stricter Spark `CHAR` types, which can 
expose behavior differences for `CHAR` columns compared with older Spark 
versions.
   This makes migration or upgrade harder for workloads that rely on the 
previous string-like behavior or on MySQL's default `CHAR` retrieval behavior, 
where trailing spaces are removed on read. Users may otherwise need to wrap 
many `CHAR` columns with `rtrim()` manually in queries.
   This PR provides an opt-in configuration to make this behavior easier to 
control without changing Spark's default semantics.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   This PR adds a new SQL configuration:
   ```text
   spark.sql.charTrimTrailingSpacesOnRead
   The default value is false, so existing behavior is unchanged.
   
   When set to true, Spark trims trailing spaces from CHAR(N) columns and 
fields when reading table data. The option does not affect VARCHAR or STRING, 
and it does not change write-side CHAR/VARCHAR length checks.
   
   Example:
   SET spark.sql.charTrimTrailingSpacesOnRead=true;
   
   CREATE TABLE t (c CHAR(4), v VARCHAR(4), s STRING) USING parquet;
   INSERT INTO t VALUES ('12', '12 ', '12 ');
   
   SELECT c, length(c), v, length(v), s, length(s) FROM t;
   
   With the new configuration enabled, the CHAR(4) value is returned without 
trailing spaces, while VARCHAR and STRING remain unchanged.
   
   
   
   ### How was this patch tested?
   Added test coverage in CharVarcharTestSuite for trimming trailing spaces 
from CHAR columns and nested CHAR fields on read, while keeping VARCHAR and 
STRING unchanged.
   
   Tested with:
   ./dev/scalastyle
   build/sbt "sql/testOnly *CharVarcharTestSuite"
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Assisted by ChatGPT-5.5
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to