This is an automated email from the ASF dual-hosted git repository.
maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 74c3757513c9 [MINOR][DOCS] Add a migration guide for encode/decode
unmappable characters
74c3757513c9 is described below
commit 74c3757513c9f580d060a88982463f3a8b1745b4
Author: Kent Yao <[email protected]>
AuthorDate: Wed Dec 4 14:00:11 2024 +0100
[MINOR][DOCS] Add a migration guide for encode/decode unmappable characters
### What changes were proposed in this pull request?
Add a migration guide for encode/decode unmappable characters
### Why are the changes needed?
Providing upgrading guides for users
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
passing doc build
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #49058 from yaooqinn/minor.
Authored-by: Kent Yao <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
---
docs/sql-migration-guide.md | 1 +
1 file changed, 1 insertion(+)
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index ea4dbe926d14..717d27befef0 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -33,6 +33,7 @@ license: |
- Since Spark 4.0, `spark.sql.parquet.compression.codec` drops the support of
codec name `lz4raw`, please use `lz4_raw` instead.
- Since Spark 4.0, when overflowing during casting timestamp to byte/short/int
under non-ansi mode, Spark will return null instead a wrapping value.
- Since Spark 4.0, the `encode()` and `decode()` functions support only the
following charsets 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE',
'UTF-16', 'UTF-32'. To restore the previous behavior when the function accepts
charsets of the current JDK used by Spark, set `spark.sql.legacy.javaCharsets`
to `true`.
+- Since Spark 4.0, the `encode()` and `decode()` functions raise
`MALFORMED_CHARACTER_CODING` error when handling unmappable characters, while
in Spark 3.5 and earlier versions, these characters will be replaced with
mojibakes. To restore the previous behavior, set
`spark.sql.legacy.codingErrorAction` to `true`. For example, if you try to
`decode` a string value `tést` / [116, -23, 115, 116] (encoded in latin1) with
'UTF-8', you get `t�st`.
- Since Spark 4.0, the legacy datetime rebasing SQL configs with the prefix
`spark.sql.legacy` are removed. To restore the previous behavior, use the
following configs:
- `spark.sql.parquet.int96RebaseModeInWrite` instead of
`spark.sql.legacy.parquet.int96RebaseModeInWrite`
- `spark.sql.parquet.datetimeRebaseModeInWrite` instead of
`spark.sql.legacy.parquet.datetimeRebaseModeInWrite`
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]