[jira] [Commented] (FLINK-6006) Kafka Consumer can lose state if queried partition list is incomplete on restore

ASF GitHub Bot (JIRA) Tue, 14 Mar 2017 02:33:01 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923872#comment-15923872
 ]


ASF GitHub Bot commented on FLINK-6006:
---------------------------------------

Github user tzulitai commented on a diff in the pull request:

    https://github.com/apache/flink/pull/3505#discussion_r105860279
  
    --- Diff: 
flink-connectors/flink-connector-kafka-base/src/main/java/org/apache/flink/streaming/connectors/kafka/internals/AbstractFetcher.java
 ---
    @@ -194,14 +194,29 @@ protected AbstractFetcher(
     
        /**
         * Restores the partition offsets.
    +    * The partitions in the provided map of restored partitions to offsets 
must completely match
    +    * the fetcher's subscribed partitions.
         * 
    -    * @param snapshotState The offsets for the partitions 
    +    * @param restoredOffsets The restored offsets for the partitions
    +    *
    +    * @throws IllegalStateException if the partitions in the provided 
restored offsets map
    +    * cannot completely match the fetcher's subscribed partitions.
         */
    -   public void restoreOffsets(Map<KafkaTopicPartition, Long> 
snapshotState) {
    -           for (KafkaTopicPartitionState<?> partition : allPartitions) {
    -                   Long offset = 
snapshotState.get(partition.getKafkaTopicPartition());
    -                   if (offset != null) {
    -                           partition.setOffset(offset);
    +   public void restoreOffsets(Map<KafkaTopicPartition, Long> 
restoredOffsets) {
    +           if (restoredOffsets.size() != allPartitions.length) {
    +                   throw new IllegalStateException(
    +                           "The fetcher was restored with partition 
offsets that do not " +
    +                                   "match with the subscribed partitions: 
" + restoredOffsets);
    --- End diff --
    
    This would not happen with the changes of this PR.
    
    In `open()`, I've set the `subscribedPartitions` to be exactly the same as 
the restored partition states. There is no filtering anymore. The 
`allPartitions` here is basically just the same list, but in their state holder 
form.
    
    The condition checks exists simply because "setting the fetcher's 
subscribed partitions" and "restoring start offsets" is 2 separate calls (the 
former is passed in through the fetcher's constructor, while the latter is 
provided through the `restoreOffsets` method). I added these checks just to 
make the fetcher code more self-contained. These exceptions should actually 
never occur.
    
    I agree this might be a bit confusing for the code reader. In the recent 
refactorings in `master`, the fetcher's subscribed partitions and start offsets 
(regardless of if it's a restore or fresh start) setup procedure is more atomic 
and less confusing in this aspect.


> Kafka Consumer can lose state if queried partition list is incomplete on 
> restore
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-6006
>                 URL: https://issues.apache.org/jira/browse/FLINK-6006
>             Project: Flink
>          Issue Type: Bug
>          Components: Kafka Connector, Streaming Connectors
>            Reporter: Tzu-Li (Gordon) Tai
>            Assignee: Tzu-Li (Gordon) Tai
>            Priority: Blocker
>             Fix For: 1.1.5, 1.2.1
>
>
> In 1.1.x and 1.2.x, the FlinkKafkaConsumer performs partition list querying 
> on restore. Then, only restored state of partitions that exists in the 
> queried list is used to initialize the fetcher's state holders.
> If in any case the returned partition list is incomplete (i.e. missing 
> partitions that existed before, perhaps due to temporary ZK / broker 
> downtime), then the state of the missing partitions is dropped and cannot be 
> recovered anymore.
> In 1.3-SNAPSHOT, this is fixed by changes in FLINK-4280, so only 1.1 and 1.2 
> is affected.
> We can backport some of the behavioural changes there to 1.1 and 1.2. 
> Generally, we should not depend on the current partition list in Kafka when 
> restoring, but just restore all previous state into the fetcher's state 
> holders. 
> This would therefore also require some checking on how the consumer threads / 
> Kafka clients behave when its assigned partitions cannot be reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (FLINK-6006) Kafka Consumer can lose state if queried partition list is incomplete on restore

Reply via email to