[
https://issues.apache.org/jira/browse/FLINK-23886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404312#comment-17404312
]
xiangqiao commented on FLINK-23886:
-----------------------------------
We download the rocksdb data file of the failed subtask task locally, and
traversed all record in *"_timer_state/processing_window-timers"* column family.
We found the following points:
1.the value length of problem records is not 0. (for example,In the
screenshot,"count:5" is normal record,"count:6" is problem record).
2.the key length of the problem record is 8 bytes less than that of the normal
record.
3.The key formate of ListState is (keygroup,key,namespace),and the key
formate of Timer is (keygroup, timestamp, key, namespace).Their lengths differ
by 8 bytes.
4.And we can use key formate of ListState to deserialize the key of the
problem record key
*5.Therefore, it can be concluded that the data in the "window-contents" state
is written to "_timer_state/processing_window-timers" state.*
{code:java}
@Test
public void testTimerRestore() throws Exception {
List<ColumnFamilyDescriptor> columnFamilyDescriptors = new ArrayList<>(1 +
3);
columnFamilyDescriptors.add(new
ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY));
columnFamilyDescriptors.add(new
ColumnFamilyDescriptor("window-contents".getBytes()));
columnFamilyDescriptors.add(new
ColumnFamilyDescriptor("_timer_state/event_window-timers".getBytes()));
columnFamilyDescriptors.add(new
ColumnFamilyDescriptor("_timer_state/processing_window-timers".getBytes()));
List<ColumnFamilyHandle> columnFamilyHandles = new ArrayList<>(1 + 3);
DataInputDeserializer inputView = new DataInputDeserializer();
int numPrefixBytes = 2;
TimerSerializer timerSerializer = new
TimerSerializer(IntSerializer.INSTANCE, new TimeWindow.Serializer());
try (RocksDB db = RocksDB.open(PredefinedOptions.DEFAULT.createDBOptions(),
"/Users/xiangqiao/timer_restore/51_500_db", columnFamilyDescriptors,
columnFamilyHandles)) {
RocksIteratorWrapper rocksIterator =
RocksDBOperationUtils.getRocksIterator(db, columnFamilyHandles.get(3));
rocksIterator.seekToFirst();
long count = 0;
while (rocksIterator.isValid()) {
System.out.println("-----count:" + count++ + "-----");
rocksIterator.next();
byte[] keyBytes = rocksIterator.key();
byte[] valueBytes = rocksIterator.value();
System.out.println("key len:" + keyBytes.length + "," + keyBytes);
System.out.println("value len:" + valueBytes.length + "," +
valueBytes);
inputView.setBuffer(keyBytes, numPrefixBytes, keyBytes.length -
numPrefixBytes);
try {
System.out.println(timerSerializer.deserialize(inputView));
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
{code}
!image-2021-08-25-17-07-38-327.png!
> An exception is thrown out when recover job timers from checkpoint file
> -----------------------------------------------------------------------
>
> Key: FLINK-23886
> URL: https://issues.apache.org/jira/browse/FLINK-23886
> Project: Flink
> Issue Type: Bug
> Components: Runtime / State Backends
> Reporter: JING ZHANG
> Priority: Major
> Attachments: image-2021-08-25-16-38-04-023.png,
> image-2021-08-25-16-38-12-308.png, image-2021-08-25-17-06-29-806.png,
> image-2021-08-25-17-07-38-327.png
>
>
> A user report the bug in the [mailist.
> |http://mail-archives.apache.org/mod_mbox/flink-user/202108.mbox/%3ccakmsf43j14nkjmgjuy4dh5qn2vbjtw4tfh4pmmuyvcvfhgf...@mail.gmail.com%3E]I
> paste the content here.
> Setup Specifics:
> Version: 1.6.2
> RocksDB Map State
> Timers stored in rocksdb
>
> When we have this job running for long periods of time like > 30 days, if
> for some reason the job restarts, we encounter "Error while deserializing the
> element". Is this a known issue fixed in later versions? I see some changes
> to code for FLINK-10175, but we don't use any queryable state
>
> Below is the stack trace
>
> org.apache.flink.util.FlinkRuntimeException: Error while deserializing the
> element.
> at
> org.apache.flink.contrib.streaming.state.RocksDBCachingPriorityQueueSet.deserializeElement(RocksDBCachingPriorityQueueSet.java:389)
> at
> org.apache.flink.contrib.streaming.state.RocksDBCachingPriorityQueueSet.peek(RocksDBCachingPriorityQueueSet.java:146)
> at
> org.apache.flink.contrib.streaming.state.RocksDBCachingPriorityQueueSet.peek(RocksDBCachingPriorityQueueSet.java:56)
> at
> org.apache.flink.runtime.state.heap.KeyGroupPartitionedPriorityQueue$InternalPriorityQueueComparator.comparePriority(KeyGroupPartitionedPriorityQueue.java:274)
> at
> org.apache.flink.runtime.state.heap.KeyGroupPartitionedPriorityQueue$InternalPriorityQueueComparator.comparePriority(KeyGroupPartitionedPriorityQueue.java:261)
> at
> org.apache.flink.runtime.state.heap.HeapPriorityQueue.isElementPriorityLessThen(HeapPriorityQueue.java:164)
> at
> org.apache.flink.runtime.state.heap.HeapPriorityQueue.siftUp(HeapPriorityQueue.java:121)
> at
> org.apache.flink.runtime.state.heap.HeapPriorityQueue.addInternal(HeapPriorityQueue.java:85)
> at
> org.apache.flink.runtime.state.heap.AbstractHeapPriorityQueue.add(AbstractHeapPriorityQueue.java:73)
> at
> org.apache.flink.runtime.state.heap.KeyGroupPartitionedPriorityQueue.<init>(KeyGroupPartitionedPriorityQueue.java:89)
> at
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBPriorityQueueSetFactory.create(RocksDBKeyedStateBackend.java:2792)
> at
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.create(RocksDBKeyedStateBackend.java:450)
> at
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.createTimerPriorityQueue(InternalTimeServiceManager.java:121)
> at
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.registerOrGetTimerService(InternalTimeServiceManager.java:106)
> at
> org.apache.flink.streaming.api.operators.InternalTimeServiceManager.getInternalTimerService(InternalTimeServiceManager.java:87)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.getInternalTimerService(AbstractStreamOperator.java:764)
> at
> org.apache.flink.streaming.api.operators.KeyedProcessOperator.open(KeyedProcessOperator.java:61)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.EOFException
> at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290)
> at org.apache.flink.types.StringValue.readString(StringValue.java:769)
> at
> org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:69)
> at
> org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:28)
> at
> org.apache.flink.api.java.typeutils.runtime.RowSerializer.deserialize(RowSerializer.java:179)
> at
> org.apache.flink.api.java.typeutils.runtime.RowSerializer.deserialize(RowSerializer.java:46)
> at
> org.apache.flink.streaming.api.operators.TimerSerializer.deserialize(TimerSerializer.java:168)
> at
> org.apache.flink.streaming.api.operators.TimerSerializer.deserialize(TimerSerializer.java:45)
> at
> org.apache.flink.contrib.streaming.state.RocksDBCachingPriorityQueueSet.deserializeElement(RocksDBCachingPriorityQueueSet.java:387)
> ... 20 more
--
This message was sent by Atlassian Jira
(v8.3.4#803005)