[ 
https://issues.apache.org/jira/browse/FLINK-39759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mao Jiayi updated FLINK-39759:
------------------------------
    Description: 
When using the StarRocks YAML connector to sync tables whose upstream {{CHAR}} 
or {{VARCHAR}} columns may contain utf8mb4 characters, the job may create 
undersized StarRocks columns and fail when writing data.

The issue happens in {*}{{StarRocksUtils.toStarRocksDataType()}}{*}. The 
current mapping logic multiplies the upstream character length by {{3}} when 
converting {{CHAR}} and {{VARCHAR}} types to StarRocks column definitions. This 
assumes that each character takes at most 3 bytes in UTF-8. However, utf8mb4 
characters can require up to 4 bytes, so the inferred StarRocks column length 
may be smaller than required.

This issue is not exposed for sources whose character set only uses up to 3 
bytes per character, because the current mapping remains sufficient in those 
cases. It only manifests when the upstream source uses utf8mb4-like encodings 
and the data contains 4-byte Unicode characters.

  was:
When using the MongoDB YAML connector to read data containing 
{{BsonDecimal128}} values, the job may fail if the decimal value has leading 
zeros after the decimal point (e.g., {{{}0.0001234{}}}).

The failure happens in {*}CdcTypeConverter.toCdcType(){*}. Java's 
{{BigDecimal}} treats leading zeros in the fractional part as insignificant 
digits, resulting in a {{precision}} that is smaller than the {{{}scale{}}}. 
For example, {{0.0001234}} yields {{precision = 4}} and {{{}scale = 7{}}}. 
Flink CDC's *DECIMAL(precision, scale)* type requires {{{}precision >= 
scale{}}}, so the type inference throws an exception.

This issue is not exposed for typical {{Decimal128}} values like {{10.99}} 
(where {{{}precision = 4{}}}, {{{}scale = 2{}}}) because the constraint 
naturally holds. It only manifests when the fractional part contains more 
digits than the significant digits — specifically when leading zeros push the 
{{scale}} beyond the {{{}precision{}}}.

        Summary: Fix StarRocks CHAR / VARCHAR mapping rules to contain utf8mb4 
characters  (was: Fix MongoDB YAML connector fails to infer type from some 
BsonDecimal128 data)

> Fix StarRocks CHAR / VARCHAR mapping rules to contain utf8mb4 characters
> ------------------------------------------------------------------------
>
>                 Key: FLINK-39759
>                 URL: https://issues.apache.org/jira/browse/FLINK-39759
>             Project: Flink
>          Issue Type: Bug
>          Components: Flink CDC
>            Reporter: Mao Jiayi
>            Priority: Major
>              Labels: pull-request-available
>
> When using the StarRocks YAML connector to sync tables whose upstream 
> {{CHAR}} or {{VARCHAR}} columns may contain utf8mb4 characters, the job may 
> create undersized StarRocks columns and fail when writing data.
> The issue happens in {*}{{StarRocksUtils.toStarRocksDataType()}}{*}. The 
> current mapping logic multiplies the upstream character length by {{3}} when 
> converting {{CHAR}} and {{VARCHAR}} types to StarRocks column definitions. 
> This assumes that each character takes at most 3 bytes in UTF-8. However, 
> utf8mb4 characters can require up to 4 bytes, so the inferred StarRocks 
> column length may be smaller than required.
> This issue is not exposed for sources whose character set only uses up to 3 
> bytes per character, because the current mapping remains sufficient in those 
> cases. It only manifests when the upstream source uses utf8mb4-like encodings 
> and the data contains 4-byte Unicode characters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to