gustavodemorais opened a new pull request, #28110:
URL: https://github.com/apache/flink/pull/28110
## What is the purpose of the change
Adds the `@PublicEvolving` `StringData.fromUtf8Bytes` connector API and an
internal Flink-style UTF-8 validator. Existing `StringData.fromBytes` wraps the
byte array without validation, so invalid UTF-8 propagates and is later
silently substituted with `U+FFFD`. The new factory validates at ingestion and
throws on malformed input. Foundation for FLIP-568 - the SQL functions and the
strict `CAST(BYTES AS STRING)` ship in follow-up tickets.
## Brief change log
- Add `StringData.fromUtf8Bytes(byte[])` and `(byte[], int, int)`
`@PublicEvolving` factories. Throw `IllegalArgumentException` (with byte index)
on invalid UTF-8; return `null` for `null` input (matches Spark's
`UTF8String.fromBytes` and `BinaryStringData.fromString`).
- Mirror factories on `BinaryStringData` for same-package internal callers.
- Promote `StringUtf8Utils` to `public` and add
`firstInvalidUtf8ByteIndex(byte[], int, int)` next to the existing
`decodeUTF8Strict`. Same byte-level checks, no char-buffer side effect.
- Pull bit-pattern checks into named private helpers (`isAsciiByte`,
`is{2,3,4}ByteLead`, `isContinuation`, `isOverlong3`,
`decode{3,4}ByteSequence`) so the validator reads as prose. JIT inlines them.
## Verifying this change
- `StringUtf8UtilsTest` covers code-point boundaries at every width,
above-U+10FFFF, forbidden lead bytes, all overlong forms, full surrogate range,
ASCII fast-path mid-stream exit, offset/length variant, and bounds-check
failures.
- `BinaryStringDataFromUtf8BytesTest` covers the factory contract:
`testFromUtf8Bytes` (happy path + null tolerance) and
`testFromUtf8BytesRejectsInvalid` (each malformed class + relative-byte-index
error message).
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (yes - new `StringData.fromUtf8Bytes` static factories on
the `@PublicEvolving` interface)
- The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no - opt-in
factory; existing `fromBytes` unchanged)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
- The S3 file system connector: (no)
## Documentation
- Does this pull request introduce a new feature? (yes)
- If yes, how is the feature documented? (JavaDocs on
`StringData.fromUtf8Bytes` covering when to prefer it over `fromBytes` and the
O(n) vs O(1) trade-off. SQL docs ship with FLINK-39602.)
---
##### Was generative AI tooling used to co-author this PR?
- [x] Yes (please specify the tool below)
2.1.117 (Claude Code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]