[jira] [Commented] (FLINK-19016) Checksum mismatch when restore from RocksDB

Stephan Ewen (Jira) Mon, 21 Sep 2020 11:33:42 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199577#comment-17199577
 ]


Stephan Ewen commented on FLINK-19016:
--------------------------------------

[~stevenz3wu] If this is in fact possible, that error happen in that case 
without the writer noticing them, then we may have an issue.

Is it possible to have cases where write succeeds, data is on OS cache, but the 
disk is full so some part of the file gets dropped later? 

> Checksum mismatch when restore from RocksDB
> -------------------------------------------
>
>                 Key: FLINK-19016
>                 URL: https://issues.apache.org/jira/browse/FLINK-19016
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.1
>            Reporter: Jiayi Liao
>            Priority: Major
>
> The error stack is shown below:
> {code:java}
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed 
> state backend for 
> KeyedMapBundleOperator_44cfc1ca74b40bb44eed1f38f72b3ea9_(71/300) from any of 
> the 1 provided restore options.
>         at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
>         at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)
>         at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>         ... 6 more
> Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught 
> unexpected exception.
>         at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:333)
>         at 
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:580)
>         at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)
>         at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
>         at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
>         ... 8 more
> Caused by: java.io.IOException: Error while opening RocksDB instance.
>         at 
> org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:74)
>         at 
> org.apache.flink.contrib.streaming.state.restore.AbstractRocksDBRestoreOperation.openDB(AbstractRocksDBRestoreOperation.java:131)
>         at 
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:214)
>         at 
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188)
>         at 
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)
>         at 
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)
>         at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:277)
>         ... 12 more
> Caused by: org.rocksdb.RocksDBException: checksum mismatch
>         at org.rocksdb.RocksDB.open(Native Method)
>         at org.rocksdb.RocksDB.open(RocksDB.java:286)
>         at 
> org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:66)
>         ... 18 more
> {code}
> The machine goes down because of hardware problem, then the job cannot 
> restart successfully anymore. After digging a little bit, I found that 
> RocksDB in Flink uses sync instead of fsync to synchronized the data with the 
> disk. With sync operation, the RocksDB cannot guarantee that the current 
> in-progress file can be persisted on disk in takeDBNativeCheckpoint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19016) Checksum mismatch when restore from RocksDB

Reply via email to