(spark) branch master updated: [MINOR][DOCS] Add a migration guide for encode/decode unmappable characters

maxgekk Wed, 04 Dec 2024 05:00:48 -0800

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 74c3757513c9 [MINOR][DOCS] Add a migration guide for encode/decode 
unmappable characters
74c3757513c9 is described below

commit 74c3757513c9f580d060a88982463f3a8b1745b4
Author: Kent Yao <[email protected]>
AuthorDate: Wed Dec 4 14:00:11 2024 +0100

    [MINOR][DOCS] Add a migration guide for encode/decode unmappable characters
    
    ### What changes were proposed in this pull request?
    
    Add a migration guide for encode/decode unmappable characters
    
    ### Why are the changes needed?
    
    Providing upgrading guides for users
    
    ### Does this PR introduce _any_ user-facing change?
    no
    
    ### How was this patch tested?
    passing doc build
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #49058 from yaooqinn/minor.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Max Gekk <[email protected]>
---
 docs/sql-migration-guide.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index ea4dbe926d14..717d27befef0 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -33,6 +33,7 @@ license: |
 - Since Spark 4.0, `spark.sql.parquet.compression.codec` drops the support of 
codec name `lz4raw`, please use `lz4_raw` instead.
 - Since Spark 4.0, when overflowing during casting timestamp to byte/short/int 
under non-ansi mode, Spark will return null instead a wrapping value.
 - Since Spark 4.0, the `encode()` and `decode()` functions support only the 
following charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 
'UTF-16', 'UTF-32'. To restore the previous behavior when the function accepts 
charsets of the current JDK used by Spark, set `spark.sql.legacy.javaCharsets` 
to `true`.
+- Since Spark 4.0, the `encode()` and `decode()` functions raise 
`MALFORMED_CHARACTER_CODING` error when handling unmappable characters, while 
in Spark 3.5 and earlier versions, these characters will be replaced with 
mojibakes. To restore the previous behavior, set 
`spark.sql.legacy.codingErrorAction` to `true`. For example, if you try to 
`decode` a string value `tést` / [116, -23, 115, 116] (encoded in latin1) with 
'UTF-8', you get `t�st`.
 - Since Spark 4.0, the legacy datetime rebasing SQL configs with the prefix 
`spark.sql.legacy` are removed. To restore the previous behavior, use the 
following configs:
   - `spark.sql.parquet.int96RebaseModeInWrite` instead of 
`spark.sql.legacy.parquet.int96RebaseModeInWrite`
   - `spark.sql.parquet.datetimeRebaseModeInWrite` instead of 
`spark.sql.legacy.parquet.datetimeRebaseModeInWrite`


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [MINOR][DOCS] Add a migration guide for encode/decode unmappable characters

Reply via email to