[ 
https://issues.apache.org/jira/browse/FLINK-32953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinzhong Li updated FLINK-32953:
--------------------------------
    Description: 
Because expired data is cleaned up in background on a best effort basis 
(hashmap use INCREMENTAL_CLEANUP strategy, rocksdb use 
ROCKSDB_COMPACTION_FILTER strategy), some expired state is often persisted into 
snapshots.

 

In some scenarios, user changes the state ttl of the job and then restore job 
from the old state. If the user adjust the state ttl from a short value to a 
long value (eg, from 12 hours to 24 hours),  some expired data that was not 
cleaned up will be alive after restore. Obviously this is unreasonable, and may 
break data regulatory requirements. 

 

Particularly, rocksdb stateBackend may cause data correctness problems due to 
level compaction in this case.(eg. One key has two versions at level-1 and 
level-2,both of which are ttl expired. Then level-1 version is cleaned up by 
compaction,  and level-2 version isn't.  If we adjust state ttl and restart 
job, the incorrect data of level-2 will become valid after restore)

 

To solve this problem, I think we can
1) persist old state ttl into snapshot meta info; (eg. 
RegisteredKeyValueStateBackendMetaInfo or others)
2) During state restore, check the size between the current ttl and old ttl;
3) If current ttl is longer than old ttl, we need to iterate over all data, 
filter out expired data with old ttl, and wirte valid data into stateBackend.

  was:
Because expired data is cleaned up in background on a best effort basis 
(hashmap use INCREMENTAL_CLEANUP strategy, rocksdb use 
ROCKSDB_COMPACTION_FILTER strategy), some expired state is often persisted into 
snapshots.

 

In some scenarios, user changes the state ttl of the job and then restore job 
from the old state. If the user adjust the state ttl from a short value to a 
long value (eg, from 12 hours to 24 hours),  some expired data that was not 
cleaned up will be alive after restore. Obviously this is unreasonable, and may 
break data regulatory requirements. 

 

Particularly, rocksdb stateBackend may cause data correctness problems due to 
level compaction in this case.(eg. One key has two versions at level-1 and 
level-2,both of which are ttl expired. Then level-1 version is cleaned up by 
compaction,  and level-2 version isn't.  
 If we adjust state ttl and restart job, the incorrect data of level-2 will 
become valid after restore)


To solve this problem, I think we can
1) persist old state ttl into snapshot meta info; (eg. 
RegisteredKeyValueStateBackendMetaInfo or others)
2) During state restore, check the size between the current ttl and old ttl;
3) If current ttl is longer than old ttl, we need to iterate over all data, 
filter out expired data with old ttl, and wirte valid data into stateBackend.


> [State TTL]resolve data correctness problem after ttl was changed 
> ------------------------------------------------------------------
>
>                 Key: FLINK-32953
>                 URL: https://issues.apache.org/jira/browse/FLINK-32953
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>            Reporter: Jinzhong Li
>            Priority: Major
>
> Because expired data is cleaned up in background on a best effort basis 
> (hashmap use INCREMENTAL_CLEANUP strategy, rocksdb use 
> ROCKSDB_COMPACTION_FILTER strategy), some expired state is often persisted 
> into snapshots.
>  
> In some scenarios, user changes the state ttl of the job and then restore job 
> from the old state. If the user adjust the state ttl from a short value to a 
> long value (eg, from 12 hours to 24 hours),  some expired data that was not 
> cleaned up will be alive after restore. Obviously this is unreasonable, and 
> may break data regulatory requirements. 
>  
> Particularly, rocksdb stateBackend may cause data correctness problems due to 
> level compaction in this case.(eg. One key has two versions at level-1 and 
> level-2,both of which are ttl expired. Then level-1 version is cleaned up by 
> compaction,  and level-2 version isn't.  If we adjust state ttl and restart 
> job, the incorrect data of level-2 will become valid after restore)
>  
> To solve this problem, I think we can
> 1) persist old state ttl into snapshot meta info; (eg. 
> RegisteredKeyValueStateBackendMetaInfo or others)
> 2) During state restore, check the size between the current ttl and old ttl;
> 3) If current ttl is longer than old ttl, we need to iterate over all data, 
> filter out expired data with old ttl, and wirte valid data into stateBackend.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to