[ 
https://issues.apache.org/jira/browse/KAFKA-19996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Calvin Liu updated KAFKA-19996:
-------------------------------
    Description: 
Thanks to [~mjd95] to report the following issue.

In the Partition.scala, the condition for the ISR expand and shrink does not 
match, which can cause a follower flapping in and out of ISR when:
 # The partition is under min-isr which means the HWM can't advance.
 # The replication is slow.
 # The follower is far behind the LEO, but it is at/above the HWM.

Due to the "conflict" between the ISR expand and shrink:
 * Shrink ISR, attempted on a schedule, triggers if the follower's LEO has not 
recently caught up with the leader's LEO which takes the last caught up time 
into the consideration.

{code:java}
/**
 * Returns true when the replica is considered as "caught-up". A replica is
 * considered "caught-up" when its log end offset is equals to the log end
 * offset of the leader OR when its last caught up time minus the current
 * time is smaller than the max replica lag.
 */
public boolean isCaughtUp(
    long leaderEndOffset,
    long currentTimeMs,
    long replicaMaxLagMs) {
    return leaderEndOffset == logEndOffset() || currentTimeMs - 
lastCaughtUpTimeMs <= replicaMaxLagMs;
} {code}
 
 * Expand ISR, attempted on processing follower fetch, triggers if the 
follower's LEO caught up with the leader's HWM in this fetch which only 
consider whether the replica reaches the HWM.

{code:java}
private def isFollowerInSync(followerReplica: Replica): Boolean = {
  leaderLogIfLocal.exists { leaderLog =>
    val followerEndOffset = followerReplica.stateSnapshot.logEndOffset
    followerEndOffset >= leaderLog.highWatermark && 
leaderEpochStartOffsetOpt.exists(followerEndOffset >= _)
  }
} {code}

  was:
Thanks to [~mjd95] to report the following issue.

In the Partition.scala, the condition for the ISR expand and shrink does not 
match, which can cause a follower flapping in and out of ISR when:
 # The partition is under min-isr which means the HWM can't advance.
 # The replication is slow.
 # The follower is far behind the LEO, but it is at/above the HWM.

Due to the "conflict" between the ISR expand and shrink:
 * Shrink ISR, attempted on a schedule, triggers if the follower's LEO has not 
recently caught up with the leader's LEO which takes the last caught up time 
into the consideration.

 
{code:java}
/**
 * Returns true when the replica is considered as "caught-up". A replica is
 * considered "caught-up" when its log end offset is equals to the log end
 * offset of the leader OR when its last caught up time minus the current
 * time is smaller than the max replica lag.
 */
public boolean isCaughtUp(
    long leaderEndOffset,
    long currentTimeMs,
    long replicaMaxLagMs) {
    return leaderEndOffset == logEndOffset() || currentTimeMs - 
lastCaughtUpTimeMs <= replicaMaxLagMs;
} {code}
 
 * Expand ISR, attempted on processing follower fetch, triggers if the 
follower's LEO caught up with the leader's HWM in this fetch which only 
consider whether the replica reaches the HWM.

{code:java}
private def isFollowerInSync(followerReplica: Replica): Boolean = {
  leaderLogIfLocal.exists { leaderLog =>
    val followerEndOffset = followerReplica.stateSnapshot.logEndOffset
    followerEndOffset >= leaderLog.highWatermark && 
leaderEpochStartOffsetOpt.exists(followerEndOffset >= _)
  }
} {code}


> The conflicts between ExpandIsr and ShrinkIsr
> ---------------------------------------------
>
>                 Key: KAFKA-19996
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19996
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Calvin Liu
>            Priority: Major
>
> Thanks to [~mjd95] to report the following issue.
> In the Partition.scala, the condition for the ISR expand and shrink does not 
> match, which can cause a follower flapping in and out of ISR when:
>  # The partition is under min-isr which means the HWM can't advance.
>  # The replication is slow.
>  # The follower is far behind the LEO, but it is at/above the HWM.
> Due to the "conflict" between the ISR expand and shrink:
>  * Shrink ISR, attempted on a schedule, triggers if the follower's LEO has 
> not recently caught up with the leader's LEO which takes the last caught up 
> time into the consideration.
> {code:java}
> /**
>  * Returns true when the replica is considered as "caught-up". A replica is
>  * considered "caught-up" when its log end offset is equals to the log end
>  * offset of the leader OR when its last caught up time minus the current
>  * time is smaller than the max replica lag.
>  */
> public boolean isCaughtUp(
>     long leaderEndOffset,
>     long currentTimeMs,
>     long replicaMaxLagMs) {
>     return leaderEndOffset == logEndOffset() || currentTimeMs - 
> lastCaughtUpTimeMs <= replicaMaxLagMs;
> } {code}
>  
>  * Expand ISR, attempted on processing follower fetch, triggers if the 
> follower's LEO caught up with the leader's HWM in this fetch which only 
> consider whether the replica reaches the HWM.
> {code:java}
> private def isFollowerInSync(followerReplica: Replica): Boolean = {
>   leaderLogIfLocal.exists { leaderLog =>
>     val followerEndOffset = followerReplica.stateSnapshot.logEndOffset
>     followerEndOffset >= leaderLog.highWatermark && 
> leaderEpochStartOffsetOpt.exists(followerEndOffset >= _)
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to