Yordan Pavlov created FLINK-28802:
-------------------------------------
Summary: EOFException when recovering from a checkpoint
Key: FLINK-28802
URL: https://issues.apache.org/jira/browse/FLINK-28802
Project: Flink
Issue Type: Bug
Reporter: Yordan Pavlov
Flink version: 1.14.2
When recovering from a Savepoint, the TaskManager would throw an EOFException
at this point:
[https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaProducer.java#L1681]
It looks like the transactional id is missing on deserialization. As an
interesting observation, before trying to recover from the mentioned checkpoint
the job failed with:
{code:java}
Aug 2, 2022 @ 16:50:40.107 Caused by:
org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received
an out of order sequence number.
Aug 2, 2022 @ 16:50:40.106 2022-08-02 13:50:40.106
[flink-akka.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Time correct sink
UTXO Balances -> Sink Save to Kafka UTXO Balances bch-balances-v4 (1/1)
(d2847bd31137561e3ba075a7cf70ee7c) switched from RUNNING to FAILED on
10.42.238.71:37353-fc3e48 @
bch-balances-v4-flink-taskmanager-0.bch-balances-v4-flink-taskmanager.flink.svc.cluster.local
(dataPort=42553).
Aug 2, 2022 @ 16:50:40.106 org.apache.flink.util.FlinkRuntimeException:
Failed to send data to Kafka bch-balances-v4-0@-1 with
FlinkKafkaInternalProducer{transactionalId='bch-balances-v4-0-690507',
inTransaction=true, closed=false} {code}
Could it be that Flink actually failed to construct the checkpoint but still
marked it as completed. What would be the way to check this? Is there something
like a checksum byte that is checked on recovery?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)