[
https://issues.apache.org/jira/browse/FLINK-39759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Leonard Xu updated FLINK-39759:
-------------------------------
Fix Version/s: cdc-3.6.0
> Fix StarRocks CHAR / VARCHAR mapping rules to contain utf8mb4 characters
> ------------------------------------------------------------------------
>
> Key: FLINK-39759
> URL: https://issues.apache.org/jira/browse/FLINK-39759
> Project: Flink
> Issue Type: Bug
> Components: Flink CDC
> Affects Versions: cdc-3.5.0
> Reporter: Mao Jiayi
> Assignee: Mao Jiayi
> Priority: Major
> Labels: pull-request-available
> Fix For: cdc-3.6.0
>
>
> When using the StarRocks YAML connector to sync tables whose upstream
> {{CHAR}} or {{VARCHAR}} columns may contain utf8mb4 characters, the job may
> create undersized StarRocks columns and fail when writing data.
> The issue happens in {*}{{StarRocksUtils.toStarRocksDataType()}}{*}. The
> current mapping logic multiplies the upstream character length by {{3}} when
> converting {{CHAR}} and {{VARCHAR}} types to StarRocks column definitions.
> This assumes that each character takes at most 3 bytes in UTF-8. However,
> utf8mb4 characters can require up to 4 bytes, so the inferred StarRocks
> column length may be smaller than required.
> This issue is not exposed for sources whose character set only uses up to 3
> bytes per character, because the current mapping remains sufficient in those
> cases. It only manifests when the upstream source uses utf8mb4-like encodings
> and the data contains 4-byte Unicode characters.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)