[ 
https://issues.apache.org/jira/browse/FLINK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanquan Lv updated FLINK-38644:
-------------------------------
    Affects Version/s: cdc-3.5.0
                       cdc-3.4.0
                       cdc-3.2.1
                       cdc-3.3.0
                       cdc-3.1.1
                       cdc-3.2.0
                       cdc-3.1.0

> Reading tables with String type as the primary key may cause OutOfMemory Error
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-38644
>                 URL: https://issues.apache.org/jira/browse/FLINK-38644
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: cdc-3.1.0, cdc-3.2.0, cdc-3.1.1, cdc-3.3.0, cdc-3.2.1, 
> cdc-3.4.0, cdc-3.5.0
>            Reporter: Yanquan Lv
>            Priority: Major
>
> When using a {*}String type as the primary key{*}, {{MySqlChunkSplitter}} 
> employs an {*}unevenly chunking algorithm{*}. Specifically, it queries the 
> {{min}} and {{max}} values of the key range, calculates the {{ChunkEnd}} 
> based on {{chunkStart}} and {{{}chunkSize{}}}, and compares {{ChunkEnd}} with 
> {{max}} to determine whether to proceed with the next chunk split.
> However, during the querying of {{{}min{}}}, {{{}max{}}}, and 
> {{{}ChunkEnd{}}}, *MySQL's sorting rules* are applied. In contrast, when 
> comparing {{ChunkEnd}} and {{max}} to decide the chunk boundary, the 
> comparison relies on {*}Java's string sorting rules{*}. By default, *MySQL is 
> case-insensitive* in string comparisons, while {*}Java's string sorting is 
> case-sensitive{*}. This discrepancy may result in {*}unexpected outcomes{*}, 
> which can ultimately lead to an {*}{{OutOfMemoryError}}{*}.
> For example, in MySQL, consider a set of primary key data sorted by the 
> database's collation rules as:
> {{{}"a1,A2,b1,B2,c1,C2,d1,D2,e1,E2,f1,F2"{}}}.
> Assume the {{chunkSize}} is 4. The computed {{min/max}} values would be 
> {{a1}} and {{{}F2{}}}.
>  * {*}First Chunk{*}: The calculated {{chunkEnd}} is {{{}B2{}}}.
>  * {*}Second Chunk{*}: The calculated {{chunkEnd}} is {{{}d1{}}}.
> However, due to Java's lexicographical string comparison (case-sensitive), 
> {{d1}} is considered *greater than* {{F2}} (since {{'d' < 'F'}} in ASCII). As 
> a result:
>  * The second chunk's {{chunkEnd}} becomes {{{}null{}}}.
>  * The final chunks are: {{[null, B2]}} and {{{}[B2, null]{}}}.
> This inconsistency may lead to the second chunk being incorrectly processed 
> by the {*}TaskManager{*}, potentially causing an {*}{{OutOfMemoryError}}{*}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to