[ 
https://issues.apache.org/jira/browse/SPARK-56819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080044#comment-18080044
 ] 

XiaodongHuan commented on SPARK-56819:
--------------------------------------

I would like to work on this issue.

The initial idea is to add an opt-in SQL configuration for MySQL-compatible 
CHAR retrieval behavior. When the new configuration is enabled, Spark would 
trim trailing spaces from CHAR(N) columns/fields on the read path. The default 
value should be false, so the current Spark behavior remains unchanged.

This option would only affect CHAR types when reading table data. It should not 
change VARCHAR/STRING semantics, nor the existing write-side CHAR/VARCHAR 
length checks.

One open question is how this option should interact with 
spark.sql.readSideCharPadding. My current thought is that the new trim-on-read 
behavior should take precedence when explicitly enabled, since applying both 
read-side padding and read-side trimming would be confusing.

Please let me know if this approach and the configuration semantics sound 
reasonable before I start working on a PR.
I can submit a PR after the expected behavior and configuration name are agreed 
on.

> Add an option to trim trailing spaces when reading CHAR columns
> ---------------------------------------------------------------
>
>                 Key: SPARK-56819
>                 URL: https://issues.apache.org/jira/browse/SPARK-56819
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.0.1, 4.1.1
>         Environment: spark-4.0.1
>            Reporter: XiaodongHuan
>            Priority: Major
>
> Spark currently enforces CHAR(N) fixed-length semantics by padding CHAR 
> values on write, and by applying read-side padding when 
> spark.sql.readSideCharPadding is enabled. This behavior is different from 
> MySQL, where CHAR values normally have trailing spaces removed on retrieval 
> unless PAD_CHAR_TO_FULL_LENGTH is enabled.
> This difference makes MySQL-to-Spark migration harder for workloads that rely 
> on MySQL's default CHAR retrieval behavior. Users may observe different 
> results for functions such as length(), concat(), comparisons in application 
> code, or downstream BI/reporting queries, unless they manually wrap CHAR 
> columns with rtrim() in every query.
> This proposal is to add an opt-in SQL configuration that trims trailing 
> spaces from CHAR(N) columns/fields when reading table data. The default 
> should preserve the current Spark behavior for compatibility. The new option 
> should only affect CHAR types on the read path, and should not change 
> VARCHAR/STRING semantics or write-side CHAR/VARCHAR length checks.
> The interaction with the existing spark.sql.readSideCharPadding option should 
> be clearly defined, so users can choose between Spark's fixed-length CHAR 
> behavior and MySQL-compatible CHAR retrieval behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to