lriggs opened a new pull request, #50137: URL: https://github.com/apache/arrow/pull/50137
### Rationale for this change `CHR(n)` only worked for ASCII (0–127). Values ≥ 128 emitted a single raw byte (invalid UTF‑8), causing "Error during planning". Goal: emit the proper multi‑byte **UTF‑8 encoding** of the Unicode code point, consistent with PostgreSQL/Snowflake. ### What changes are included in this PR? #### Arrow (C++ / Gandiva) | File | Change | |------|--------| | `cpp/src/gandiva/precompiled/string_ops.cc` | `chr_int64` rewritten to UTF‑8‑encode the code point (1–4 bytes) and error on invalid input (negative, > 0x10FFFF, surrogate range 0xD800–0xDFFF). `chr_int32` now delegates to it. | | `cpp/src/gandiva/precompiled/string_ops_test.cc` | `TestChrBigInt` rewritten for UTF‑8 semantics: every byte‑length boundary (1/2/3/4‑byte, low+high), í/€/日/😀, and the three invalid‑input error cases. | ### Are these changes tested? Yes, unit tests. ### Are there any user-facing changes? Yes, the CHR gandiva function now supports unicode characters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
