[
https://issues.apache.org/jira/browse/FLINK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yanquan Lv updated FLINK-38644:
-------------------------------
Affects Version/s: cdc-3.5.0
cdc-3.4.0
cdc-3.2.1
cdc-3.3.0
cdc-3.1.1
cdc-3.2.0
cdc-3.1.0
> Reading tables with String type as the primary key may cause OutOfMemory Error
> ------------------------------------------------------------------------------
>
> Key: FLINK-38644
> URL: https://issues.apache.org/jira/browse/FLINK-38644
> Project: Flink
> Issue Type: Bug
> Affects Versions: cdc-3.1.0, cdc-3.2.0, cdc-3.1.1, cdc-3.3.0, cdc-3.2.1,
> cdc-3.4.0, cdc-3.5.0
> Reporter: Yanquan Lv
> Priority: Major
>
> When using a {*}String type as the primary key{*}, {{MySqlChunkSplitter}}
> employs an {*}unevenly chunking algorithm{*}. Specifically, it queries the
> {{min}} and {{max}} values of the key range, calculates the {{ChunkEnd}}
> based on {{chunkStart}} and {{{}chunkSize{}}}, and compares {{ChunkEnd}} with
> {{max}} to determine whether to proceed with the next chunk split.
> However, during the querying of {{{}min{}}}, {{{}max{}}}, and
> {{{}ChunkEnd{}}}, *MySQL's sorting rules* are applied. In contrast, when
> comparing {{ChunkEnd}} and {{max}} to decide the chunk boundary, the
> comparison relies on {*}Java's string sorting rules{*}. By default, *MySQL is
> case-insensitive* in string comparisons, while {*}Java's string sorting is
> case-sensitive{*}. This discrepancy may result in {*}unexpected outcomes{*},
> which can ultimately lead to an {*}{{OutOfMemoryError}}{*}.
> For example, in MySQL, consider a set of primary key data sorted by the
> database's collation rules as:
> {{{}"a1,A2,b1,B2,c1,C2,d1,D2,e1,E2,f1,F2"{}}}.
> Assume the {{chunkSize}} is 4. The computed {{min/max}} values would be
> {{a1}} and {{{}F2{}}}.
> * {*}First Chunk{*}: The calculated {{chunkEnd}} is {{{}B2{}}}.
> * {*}Second Chunk{*}: The calculated {{chunkEnd}} is {{{}d1{}}}.
> However, due to Java's lexicographical string comparison (case-sensitive),
> {{d1}} is considered *greater than* {{F2}} (since {{'d' < 'F'}} in ASCII). As
> a result:
> * The second chunk's {{chunkEnd}} becomes {{{}null{}}}.
> * The final chunks are: {{[null, B2]}} and {{{}[B2, null]{}}}.
> This inconsistency may lead to the second chunk being incorrectly processed
> by the {*}TaskManager{*}, potentially causing an {*}{{OutOfMemoryError}}{*}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)