How to unsubscribe? On Fri, May 19, 2023 at 8:13 AM Feifan Wang (Jira) <j...@apache.org> wrote:
> Feifan Wang created FLINK-32130: > ----------------------------------- > > Summary: previous checkpoint will be broke by the subsequent > incremental checkpoint > Key: FLINK-32130 > URL: https://issues.apache.org/jira/browse/FLINK-32130 > Project: Flink > Issue Type: Bug > Reporter: Feifan Wang > > > Currently, _SharedStateRegistryImpl_ will discard old one while register > new state to same key: > {code:java} > // Old entry is not in a confirmed checkpoint yet, and the new one differs. > // This might result from (omitted KG range here for simplicity): > // 1. Flink recovers from a failure using a checkpoint 1 > // 2. State Backend is initialized to UID xyz and a set of SST: { 01.sst } > // 3. JM triggers checkpoint 2 > // 4. TM sends handle: "xyz-002.sst"; JM registers it under "xyz-002.sst" > // 5. TM crashes; everything is repeated from (2) > // 6. TM recovers from CP 1 again: backend UID "xyz", SST { 01.sst } > // 7. JM triggers checkpoint 3 > // 8. TM sends NEW state "xyz-002.sst" > // 9. JM discards it as duplicate > // 10. checkpoint completes, but a wrong SST file is used > // So we use a new entry and discard the old one: > LOG.info( > "Duplicated registration under key {} of a new state: {}. " > + "This might happen during the task failover if state > backend creates different states with the same key before and after the > failure. " > + "Discarding the OLD state and keeping the NEW one which > is included into a completed checkpoint", > registrationKey, > newHandle); > scheduledStateDeletion = entry.stateHandle; > entry.stateHandle = newHandle; {code} > But if _execution.checkpointing.max-concurrent-checkpoints_ > 1, the > following case will fail (take _RocksDBStateBackend_ as an example): > # cp1 trigger: 1.sst be uploaded to file-1, and register <1.sst,file-1>, > cp1 reference file-1 > # cp1 is not yet complete, cp2 trigger: 1.sst be uploaded to file-2, and > try register <1.sst,file-2>. SharedStateRegistry discard file-1 > # cp1 completed and cp2 failed, but the cp1 is broken (file-1 has be > deleted) > > I think we should allow register multi state object to same key, WDYT > [~pnowojski], [~roman] ? > > > > -- > This message was sent by Atlassian Jira > (v8.20.10#820010) >