[
https://issues.apache.org/jira/browse/FLINK-31192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-31192:
-----------------------------------
Labels: pull-request-available (was: )
> dataGen takes too long to initialize under sequence
> ---------------------------------------------------
>
> Key: FLINK-31192
> URL: https://issues.apache.org/jira/browse/FLINK-31192
> Project: Flink
> Issue Type: Improvement
> Affects Versions: 1.17.0, 1.15.3, 1.16.1
> Reporter: xzw0223
> Assignee: xzw0223
> Priority: Major
> Labels: pull-request-available
>
> The SequenceGenerator preloads all sequence values in open. If the
> totalElement number is too large, it will take too long.
> [https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/source/datagen/SequenceGenerator.java#L91]
> The reason is that the capacity of the Deque will be expanded twice when the
> current capacity is full, and the array copy is required, which is
> time-consuming.
>
> Here's what I think :
> do not preload the full amount of data on Sequence, and generate a piece of
> data each time next is called to solve the problem of slow initialization
> caused by loading full amount of data.
> record the currently sent Sequence position through the checkpoint, and
> continue to send data through the recorded position after an abnormal restart
> to ensure fault tolerance
--
This message was sent by Atlassian Jira
(v8.20.10#820010)