[ 
https://issues.apache.org/jira/browse/FLINK-23509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Paul updated FLINK-23509:
--------------------------------
    Affects Version/s:     (was: 1.13.0)
                       1.14.0
                       1.13.1
                       1.12.5

> FlinkKafkaInternalProducer overrides static final ProducerIdAndEpoch#NONE 
> during transaction recovery (fails)
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-23509
>                 URL: https://issues.apache.org/jira/browse/FLINK-23509
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Kafka
>    Affects Versions: 1.14.0, 1.13.1, 1.12.5
>            Reporter: Matthias Schwalbe
>            Assignee: Fabian Paul
>            Priority: Major
>         Attachments: 2021-07-26_16-47-48.png
>
>
> When recovering Kafka transactions from a snapshot, 
> FlinkKafkaInternalProducer overrides static final ProducerIdAndEpoch#NONE 
> here:
> [FlinkKafkaInternalProducer#resumeTransaction|https://github.com/apache/flink/blob/f06faf13930f2e8acccf1e04e2c250b85bdbf48e/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/internals/FlinkKafkaInternalProducer.java#L229]
> and consequently TransactionManager initializes transactions as new 
> transactions instead of recovered ones. Here:
> [TransactionManager#initializeTransactions|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L332]
> TransactionManager log (edited for readability):
> {quote}{{[Sink: trxRollupKafkaSink (1/1)#3|#3] INFO 
> org.apache.kafka.clients.producer.KafkaProducer - [Producer 
> clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, transactionalId=Sink: 
> trxRollupKafkaSink-...8b6-2] Overriding the default enable.idempotence to 
> true since transactional.id is specified.
>  [Sink: trxRollupKafkaSink (1/1)#3|#3] INFO 
> org.apache.kafka.clients.producer.KafkaProducer - [Producer 
> clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, transactionalId=Sink: 
> trxRollupKafkaSink-...8b6-2] Instantiated a transactional producer.
>  [Sink: trxRollupKafkaSink (1/1)#3|#3] INFO 
> org.apache.kafka.clients.producer.KafkaProducer - [Producer 
> clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, transactionalId=Sink: 
> trxRollupKafkaSink-...8b6-2] Overriding the default retries config to the 
> recommended value of 2147483647 since the idempotent producer is enabled.
>  [Sink: trxRollupKafkaSink (1/1)#3|#3] INFO 
> org.apache.kafka.clients.producer.KafkaProducer - [Producer 
> clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, transactionalId=Sink: 
> trxRollupKafkaSink-...8b6-2] Overriding the default acks to all since 
> idempotence is enabled.
>  [Sink: trxRollupKafkaSink (1/1)#3|#3] DEBUG 
> org.apache.kafka.clients.producer.internals.TransactionManager - [Producer 
> clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, transactionalId=Sink: 
> trxRollupKafkaSink-...8b6-2] Transition from state UNINITIALIZED to 
> INITIALIZING
>  [Sink: trxRollupKafkaSink (1/1)#3|#3] INFO 
> org.apache.kafka.clients.producer.internals.TransactionManager - [Producer 
> clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, transactionalId=Sink: 
> trxRollupKafkaSink-...8b6-2] Invoking InitProducerId for the first time in 
> order to acquire a producer ID
>  [Sink: trxRollupKafkaSink (1/1)#3|#3] DEBUG 
> org.apache.kafka.clients.producer.internals.TransactionManager - [Producer 
> clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, transactionalId=Sink: 
> trxRollupKafkaSink-...8b6-2] Enqueuing transactional request 
> InitProducerIdRequestData(transactionalId='Sink: 
> trxRollupKafkaSink-...8b6-2', transactionTimeoutMs=60000, producerId=1545118, 
> producerEpoch=17)
>  [kafka-producer-network-thread | producer-Sink: trxRollupKafkaSink-...8b6-2] 
> TRACE org.apache.kafka.clients.producer.internals.TransactionManager - 
> [Producer clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, 
> transactionalId=Sink: trxRollupKafkaSink-...8b6-2] Request 
> InitProducerIdRequestData(transactionalId='Sink: 
> trxRollupKafkaSink-...8b6-2', transactionTimeoutMs=60000, producerId=1545118, 
> producerEpoch=17) dequeued for sending
>  [kafka-producer-network-thread | producer-Sink: trxRollupKafkaSink-...8b6-2] 
> DEBUG org.apache.kafka.clients.producer.internals.TransactionManager - 
> [Producer clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, 
> transactionalId=Sink: trxRollupKafkaSink-...8b6-2] Enqueuing transactional 
> request FindCoordinatorRequestData(key='Sink: trxRollupKafkaSink-...8b6-2', 
> keyType=1)
>  [kafka-producer-network-thread | producer-Sink: trxRollupKafkaSink-...8b6-2] 
> DEBUG org.apache.kafka.clients.producer.internals.TransactionManager - 
> [Producer clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, 
> transactionalId=Sink: trxRollupKafkaSink-...8b6-2] Enqueuing transactional 
> request InitProducerIdRequestData(transactionalId='Sink: 
> trxRollupKafkaSink-...8b6-2', transactionTimeoutMs=60000, producerId=1545118, 
> producerEpoch=17)
>  [kafka-producer-network-thread | producer-Sink: trxRollupKafkaSink-...8b6-2] 
> TRACE org.apache.kafka.clients.producer.internals.TransactionManager - 
> [Producer clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, 
> transactionalId=Sink: trxRollupKafkaSink-...8b6-2] Request 
> FindCoordinatorRequestData(key='Sink: trxRollupKafkaSink-...8b6-2', 
> keyType=1) dequeued for sending
>  [kafka-producer-network-thread | producer-Sink: trxRollupKafkaSink-...8b6-2] 
> TRACE org.apache.kafka.clients.producer.internals.TransactionManager - 
> [Producer clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, 
> transactionalId=Sink: trxRollupKafkaSink-...8b6-2] Received transactional 
> response FindCoordinatorResponseData(throttleTimeMs=0, errorCode=0, 
> errorMessage='NONE', nodeId=3, host='ulxxtkafbrk03.adgr.net', port=9093) for 
> request FindCoordinatorRequestData(key='Sink: trxRollupKafkaSink-...8b6-2', 
> keyType=1)
>  [kafka-producer-network-thread | producer-Sink: trxRollupKafkaSink-...8b6-2] 
> INFO org.apache.kafka.clients.producer.internals.TransactionManager - 
> [Producer clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, 
> transactionalId=Sink: trxRollupKafkaSink-...8b6-2] Discovered transaction 
> coordinator ulxxtkafbrk03.adgr.net:9093 (id: 3 rack: null)
>  [kafka-producer-network-thread | producer-Sink: trxRollupKafkaSink-...8b6-2] 
> TRACE org.apache.kafka.clients.producer.internals.TransactionManager - 
> [Producer clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, 
> transactionalId=Sink: trxRollupKafkaSink-...8b6-2] Request 
> InitProducerIdRequestData(transactionalId='Sink: 
> trxRollupKafkaSink-...8b6-2', transactionTimeoutMs=60000, producerId=1545118, 
> producerEpoch=17) dequeued for sending
>  [kafka-producer-network-thread | producer-Sink: trxRollupKafkaSink-...8b6-2] 
> TRACE org.apache.kafka.clients.producer.internals.TransactionManager - 
> [Producer clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, 
> transactionalId=Sink: trxRollupKafkaSink-...8b6-2] Received transactional 
> response InitProducerIdResponseData(throttleTimeMs=0, errorCode=47, 
> producerId=-1, producerEpoch=-1) for request 
> InitProducerIdRequestData(transactionalId='Sink: 
> trxRollupKafkaSink-...8b6-2', transactionTimeoutMs=60000, producerId=1545118, 
> producerEpoch=17)
>  [kafka-producer-network-thread | producer-Sink: trxRollupKafkaSink-...8b6-2] 
> DEBUG org.apache.kafka.clients.producer.internals.TransactionManager - 
> [Producer clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, 
> transactionalId=Sink: trxRollupKafkaSink-...8b6-2] Transition from state 
> INITIALIZING to error state FATAL_ERROR
>  [Sink: trxRollupKafkaSink (1/1)#3|#3] INFO 
> org.apache.kafka.clients.producer.KafkaProducer - [Producer 
> clientId=producer-Sink: trxRollupKafkaSink-...8b6-2, transactionalId=Sink: 
> trxRollupKafkaSink-...8b6-2] Closing the Kafka producer with timeoutMillis = 
> 0 ms.
>  org.apache.kafka.common.KafkaException: Unexpected error in 
> InitProducerIdResponse; Producer attempted an operation with an old epoch. 
> Either there is a newer producer with the same transactionalId, or the 
> producer's transaction has been expired by the broker.
>  at 
> org.apache.kafka.clients.producer.internals.TransactionManager$InitProducerIdHandler.handleResponse(TransactionManager.java:1352)
>  at 
> org.apache.kafka.clients.producer.internals.TransactionManager$TxnRequestHandler.onComplete(TransactionManager.java:1260)
>  at 
> org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:109)
>  at 
> org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:572)
>  at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:564)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.maybeSendAndPollTransactionalRequest(Sender.java:414)
>  at 
> org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:312)
>  at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
>  at java.lang.Thread.run(Thread.java:748)
>  }}
> {quote}
>  Notice here "Invoking InitProducerId for the first time in order to acquire 
> a producer ID" indicates a request for a new transaction (-1, -1) but below 
> we see instead: "Enqueuing transactional request 
> InitProducerIdRequestData(transactionalId='Sink: ...', ..., 
> producerId=1545118, producerEpoch=17)" because of changed 
> ProducerIdAndEpoch#NONE
>  
> TransactionManager#initializeTransactions variables:
>   !2021-07-26_16-47-48.png!
> Notice the values above which should be -1, -1.
> Stack trace of TransactionManager#initializeTransactions:
> {quote}initializeTransactions:314, TransactionManager 
> (org.apache.kafka.clients.producer.internals)
>  initializeTransactions:310, TransactionManager 
> (org.apache.kafka.clients.producer.internals)
>  initTransactions:591, KafkaProducer (org.apache.kafka.clients.producer)
>  initTransactions:88, FlinkKafkaInternalProducer 
> (org.apache.flink.streaming.connectors.kafka.internals)
>  recoverAndAbort:1060, FlinkKafkaProducer 
> (org.apache.flink.streaming.connectors.kafka)
>  recoverAndAbort:99, FlinkKafkaProducer 
> (org.apache.flink.streaming.connectors.kafka)
>  initializeState:371, TwoPhaseCommitSinkFunction 
> (org.apache.flink.streaming.api.functions.sink)
>  initializeState:1195, FlinkKafkaProducer 
> (org.apache.flink.streaming.connectors.kafka)
>  tryRestoreFunction:189, StreamingFunctionUtils 
> (org.apache.flink.streaming.util.functions)
>  restoreFunctionState:171, StreamingFunctionUtils 
> (org.apache.flink.streaming.util.functions)
>  initializeState:96, AbstractUdfStreamOperator 
> (org.apache.flink.streaming.api.operators)
>  initializeOperatorState:118, StreamOperatorStateHandler 
> (org.apache.flink.streaming.api.operators)
>  initializeState:290, AbstractStreamOperator 
> (org.apache.flink.streaming.api.operators)
>  initializeStateAndOpenOperators:436, OperatorChain 
> (org.apache.flink.streaming.runtime.tasks)
>  restoreGates:574, StreamTask (org.apache.flink.streaming.runtime.tasks)
>  call:-1, 412600778 
> (org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$745)
>  call:55, StreamTaskActionExecutor$1 
> (org.apache.flink.streaming.runtime.tasks)
>  restore:554, StreamTask (org.apache.flink.streaming.runtime.tasks)
>  doRun:756, Task (org.apache.flink.runtime.taskmanager)
>  run:563, Task (org.apache.flink.runtime.taskmanager)
>  run:748, Thread (java.lang)
> {quote}
>  
> Stack trace of FlinkKafkaInternalProducer#resumeTransaction when the values 
> are overridden:
> {quote}resumeTransaction:204, FlinkKafkaInternalProducer 
> (org.apache.flink.streaming.connectors.kafka.internals)
>  resumeTransaction:196, KafkaProducerJobSink$$anon$1$$anon$2 
> (ch.viseca.flink.connectors.kafka.sinks)
>  recoverAndCommit:1029, FlinkKafkaProducer 
> (org.apache.flink.streaming.connectors.kafka)
>  recoverAndCommit:99, FlinkKafkaProducer 
> (org.apache.flink.streaming.connectors.kafka)
>  recoverAndCommitInternal:414, TwoPhaseCommitSinkFunction 
> (org.apache.flink.streaming.api.functions.sink)
>  initializeState:364, TwoPhaseCommitSinkFunction 
> (org.apache.flink.streaming.api.functions.sink)
>  initializeState:1195, FlinkKafkaProducer 
> (org.apache.flink.streaming.connectors.kafka)
>  tryRestoreFunction:189, StreamingFunctionUtils 
> (org.apache.flink.streaming.util.functions)
>  restoreFunctionState:171, StreamingFunctionUtils 
> (org.apache.flink.streaming.util.functions)
>  initializeState:96, AbstractUdfStreamOperator 
> (org.apache.flink.streaming.api.operators)
>  initializeOperatorState:118, StreamOperatorStateHandler 
> (org.apache.flink.streaming.api.operators)
>  initializeState:290, AbstractStreamOperator 
> (org.apache.flink.streaming.api.operators)
>  initializeStateAndOpenOperators:436, OperatorChain 
> (org.apache.flink.streaming.runtime.tasks)
>  restoreGates:574, StreamTask (org.apache.flink.streaming.runtime.tasks)
>  call:-1, 412600778 
> (org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$745)
>  call:55, StreamTaskActionExecutor$1 
> (org.apache.flink.streaming.runtime.tasks)
>  restore:554, StreamTask (org.apache.flink.streaming.runtime.tasks)
>  doRun:756, Task (org.apache.flink.runtime.taskmanager)
>  run:563, Task (org.apache.flink.runtime.taskmanager)
>  run:748, Thread (java.lang)
> {quote}
>  
> Background:
>  * we recently upgraded from Flink 1.8.0 to 1.13.0
>  * FlinkKafkaInternalProducer#resumeTransaction has not changed between these 
> versions, however
>  * in Flink 1.8.0 we never observed any resumable transaction as part of a 
> checkpoint
>  * we could not determine what actually made the change that causes the 
> failure, however:
>  * it would probably be much saver to instead of assigning new values to an 
> arbitrary ProducerIdAndEpoch held by TransactionManager to directly assign a 
> fresh ProducerIdAndEpoch and thus avoid overriding ProducerIdAndEpoch#NONE
> Sample workaround (scala):
> {quote}val sink = new FlinkKafkaProducer[T](val sink = new 
> FlinkKafkaProducer[T](  defaultTopic,  schema,  getProperties,  getSemantic,  
> getProducerPoolSize) {
>    override protected def createProducer: 
> FlinkKafkaInternalProducer[Array[Byte], Array[Byte]] =
> Unknown macro: \{ new FlinkKafkaInternalProducer[Array[Byte], 
> Array[Byte]](this.producerConfig) Unknown macro}
>       override def resumeTransaction(producerId: Long, epoch: Short): Unit =
>  Unknown macro: \{        val transactionManager = 
> FlinkKafkaInternalProducer.getField(kafkaProducer, "transactionManager")      
>   transactionManager.synchronized Unknown macro}
>          super.resumeTransaction(producerId, epoch)      }    }  }}
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to