[jira] [Commented] (FLINK-8132) FlinkKafkaProducer011 can commit incorrect transaction during recovery

ASF GitHub Bot (JIRA) Wed, 22 Nov 2017 08:28:29 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262878#comment-16262878
 ]


ASF GitHub Bot commented on FLINK-8132:
---------------------------------------

GitHub user pnowojski opened a pull request:

    https://github.com/apache/flink/pull/5053

    [FLINK-8132][kafka] Re-initialize transactional KafkaProducer on each 
checkpoint

    Previously faulty scenario with producer pool of 2.
        
    1. started transaction 1 with producerA, written record 42
    2. checkpoint 1 triggered, pre committing txn1, started txn2 with 
producerB, written record 43
    3. checkpoint 1 completed, committing txn1, returning producerA to the pool
    4. checkpoint 2 triggered , committing txn2, started txn3 with producerA, 
written record 44
    5. crash....
    6. recover to checkpoint 1, txn1 from producerA found to 
"pendingCommitTransactions", attempting to recoverAndCommit(txn1)
    7. unfortunately txn1 and txn3 from the same producers are identical from 
KafkaBroker perspective and thus txn3 is being committed
        
    result is that both records 42 and 44 are committed.
        
    With this fix, after re-initialization txn3 will have different 
producerId/epoch counters compared to txn1.
    
    ## Verifying this change
    
    This PR adds a test that was previously failing to cover for future 
regressions.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (yes / **no**)
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
      - The serializers: (yes / **no** / don't know)
      - The runtime per-record code paths (performance sensitive): (yes / 
**no** / don't know)
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
      - The S3 file system connector: (yes / **no** / don't know)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (yes / **no**)
      - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pnowojski/flink f8132

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5053.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5053
    
----
commit 6f0f56ca25a12da513253d738d14d8516a808bd8
Author: Piotr Nowojski <[email protected]>
Date:   2017-11-22T10:37:48Z

    [hotfix][kafka] Throw FlinkKafkaProducer011Exception with error codes 
instead of generic Exception

commit b79a372d589df54f505857ed78e120d69d0ad50a
Author: Piotr Nowojski <[email protected]>
Date:   2017-11-22T14:53:08Z

    [FLINK-8132][kafka] Re-initialize transactional KafkaProducer on each 
checkpoint
    
    Previously faulty scenario with producer pool of 2.
    
    1. started transaction 1 with producerA, written record 42
    2. checkpoint 1 triggered, pre committing txn1, started txn2 with 
producerB, written record 43
    3. checkpoint 1 completed, committing txn1, returning producerA to the pool
    4. checkpoint 2 triggered , committing txn2, started txn3 with producerA, 
written record 44
    5. crash....
    6. recover to checkpoint 1, txn1 from producerA found to 
"pendingCommitTransactions", attempting to recoverAndCommit(txn1)
    7. unfortunately txn1 and txn3 from the same producers are identical from 
KafkaBroker perspective and thus txn3 is being committed
    
    result is that both records 42 and 44 are committed.
    
    With this fix, after re-initialization txn3 will have different 
producerId/epoch counters compared to txn1.

commit a5938795c5ae0dd7022d07ee90620fb6a8ce14a9
Author: Piotr Nowojski <[email protected]>
Date:   2017-11-22T14:55:20Z

    [hotfix][kafka] Remove unused method in kafka tests

----


> FlinkKafkaProducer011 can commit incorrect transaction during recovery
> ----------------------------------------------------------------------
>
>                 Key: FLINK-8132
>                 URL: https://issues.apache.org/jira/browse/FLINK-8132
>             Project: Flink
>          Issue Type: Bug
>          Components: Kafka Connector
>            Reporter: Piotr Nowojski
>            Assignee: Piotr Nowojski
>            Priority: Blocker
>             Fix For: 1.4.0
>
>
> Faulty scenario with producer pool of 2.
> 1. started transaction 1 with producerA, written record 42
> 2. checkpoint 1 triggered, pre committing txn1, started txn2 with producerB, 
> written record 43
> 3. checkpoint 1 completed, committing txn1, returning producerA to the pool
> 4. checkpoint 2 triggered , committing txn2, started txn3 with producerA, 
> written record 44
> 5. crash....
> 6. recover to checkpoint 1, txn1 from producerA found to 
> "pendingCommitTransactions", attempting to recoverAndCommit(txn1)
> 7. unfortunately txn1 and txn3 from the same producers are identical from 
> KafkaBroker perspective and thus txn3 is being committed
> result is that both records 42 and 44 are committed. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-8132) FlinkKafkaProducer011 can commit incorrect transaction during recovery

Reply via email to