[
https://issues.apache.org/jira/browse/FLINK-18464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366196#comment-17366196
]
Roman Khachatryan commented on FLINK-18464:
-------------------------------------------
Thanks a lot for volunteering [~guxiangfly]!
IIUC, the problem is not in the state itself, but rather in WindowOperator (or
somewhere else - see below).
And the PR tries to provide a quick fix. However, I'm afraid that corrupted
snapshot be written with the fix.
The root cause seems to be in WindowOperator using the wrong state for some
windows. Potential reasons:
1. Some user code explicitly sets non-void namespace on state which uses
VoidNamespace (shouldn't be possible though)
2. Same, done by Flink (e.g. WindowOperator or Trigger Context)
3. VoidSerializer is passed as window serializer to the WindowOperator
4. Window is written to timer state. But because for Heap I see
CopyOnWriteStateTableSnapshot, I think it's not the case: for timers I'd expect
HeapPriorityQueueStateSnapshot)
5. WindowOperator somehow writes Window to its mergingSetsState (which uses
VoidNamespace serializer) - but I don't see how it's possible
6. ...
WDYT? (cc: [~dwysakowicz], [~yunta])
Could you please share your trigger code to exclude Option 1?
As the issue is likely not in the state, I'm going to close FLINK-23036 as a
dulpicate for now.
> ClassCastException during namespace serialization for checkpoint (Heap and
> RocksDB)
> -----------------------------------------------------------------------------------
>
> Key: FLINK-18464
> URL: https://issues.apache.org/jira/browse/FLINK-18464
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / State Backends
> Affects Versions: 1.9.3, 1.13.1
> Reporter: Roman Khachatryan
> Priority: Major
> Labels: pull-request-available
>
> From
> [thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-failed-because-of-TimeWindow-cannot-be-cast-to-VoidNamespace-td36310.html]
> {quote}I'm using flink 1.9 on Mesos and I try to use my own trigger and
> evictor. The state is stored to memory.
> {quote}
>
>
> {code:java}
> input.setParallelism(processParallelism)
> .assignTimestampsAndWatermarks(new UETimeAssigner)
> .keyBy(_.key)
> .window(TumblingEventTimeWindows.of(Time.minutes(20)))
> .trigger(new MyTrigger)
> .evictor(new MyEvictor)
> .process(new MyFunction).setParallelism(aggregateParallelism)
> .addSink(kafkaSink).setParallelism(sinkParallelism)
> .name("kafka-record-sink"){code}
>
>
> {code:java}
> java.lang.Exception: Could not materialize checkpoint 1 for operator
> Window(TumblingEventTimeWindows(1200000), JoinTrigger, JoinEvictor,
> ScalaProcessWindowFunctionWrapper) -> Sink: kafka-record-sink (2/5).
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:1100)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:1042)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.ExecutionException:
> java.lang.ClassCastException:
> org.apache.flink.streaming.api.windowing.windows.TimeWindow cannot be cast to
> org.apache.flink.runtime.state.VoidNamespace
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at
> org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:450)
> at
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:47)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:1011)
>
> ... 3 more
> Caused by: java.lang.ClassCastException:
> org.apache.flink.streaming.api.windowing.windows.TimeWindow cannot be cast to
> org.apache.flink.runtime.state.VoidNamespace
> at
> org.apache.flink.runtime.state.VoidNamespaceSerializer.serialize(VoidNamespaceSerializer.java:32)
> at
> org.apache.flink.runtime.state.heap.CopyOnWriteStateMapSnapshot.writeState(CopyOnWriteStateMapSnapshot.java:114)
> at
> org.apache.flink.runtime.state.heap.AbstractStateTableSnapshot.writeStateInKeyGroup(AbstractStateTableSnapshot.java:121)
> at
> org.apache.flink.runtime.state.heap.CopyOnWriteStateTableSnapshot.writeStateInKeyGroup(CopyOnWriteStateTableSnapshot.java:37)
> at
> org.apache.flink.runtime.state.heap.HeapSnapshotStrategy$1.callInternal(HeapSnapshotStrategy.java:191)
> at
> org.apache.flink.runtime.state.heap.HeapSnapshotStrategy$1.callInternal(HeapSnapshotStrategy.java:158)
> at
> org.apache.flink.runtime.state.AsyncSnapshotCallable.call(AsyncSnapshotCallable.java:75)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> org.apache.flink.runtime.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:447)
>
> ... 5 more
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)