[jira] [Updated] (FLINK-36798) Improve data processing speed during the phase from snapshot to incremental phase

Yanquan Lv (Jira) Mon, 25 Nov 2024 23:53:05 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-36798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yanquan Lv updated FLINK-36798:
-------------------------------
    Description: 
During the phase from snapshot to incremental phase, for each input record, we 
need to compare with all finished splits and find the binlog offset to check 
whether we should emit the record, however,  this complexity is `O（n）`, it's a 
very time cost procedure.

Actually, we can improve data processing speed by the following ways:
1. For numeric fields, we can directly calculate which chunk they belong to 
based on the primary key and chunk size information.this complexity is `O(1)`.
2. For non numeric fields, we can use binary search to find the shard to which 
the data belongs. this complexity is `log（n）`.

  was:
During the phase from snapshot to incremental phase, for each input record, we 
need to compare with all finished splits and find the binlog offset to check 
whether we should emit the record, however,  this complexity is `O(n)`, it's a 
very time cost procedure.

Actually, we can improve data processing speed by the following ways:
1. For numeric fields, we can directly calculate which chunk they belong to 
based on the primary key and chunk size information.this complexity is `O(1)`.
2. For non numeric fields, we can use binary search to find the shard to which 
the data belongs. this complexity is `log(n)`.


> Improve data processing speed during the phase from snapshot to incremental 
> phase
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-36798
>                 URL: https://issues.apache.org/jira/browse/FLINK-36798
>             Project: Flink
>          Issue Type: Improvement
>          Components: Flink CDC
>    Affects Versions: cdc-3.1.0, cdc-3.2.0, cdc-3.1.1
>            Reporter: Yanquan Lv
>            Priority: Major
>             Fix For: cdc-3.3.0
>
>
> During the phase from snapshot to incremental phase, for each input record, 
> we need to compare with all finished splits and find the binlog offset to 
> check whether we should emit the record, however,  this complexity is `O（n）`, 
> it's a very time cost procedure.
> Actually, we can improve data processing speed by the following ways:
> 1. For numeric fields, we can directly calculate which chunk they belong to 
> based on the primary key and chunk size information.this complexity is `O(1)`.
> 2. For non numeric fields, we can use binary search to find the shard to 
> which the data belongs. this complexity is `log（n）`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36798) Improve data processing speed during the phase from snapshot to incremental phase

Reply via email to