[ 
https://issues.apache.org/jira/browse/KAFKA-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-7601:
-----------------------------------
    Description: 
During an upgrade of the message format, there is a short time during which the 
configured message format version is not consistent across all replicas of a 
partition. As long as all brokers are on the same binary version, this 
typically does not cause any problems. Followers will take whatever message 
format is used by the leader. However, it is possible for leadership to change 
several times between brokers which support the new format and those which 
support the old format. This can cause the version used in the log to flap 
between the different formats until the upgrade is complete. 

For example, suppose broker 1 has been updated to use v2 and broker 2 is still 
on v1. When broker 1 is the leader, all new messages will be written in the v2 
format. When broker 2 is leader, v1 will be used. If there is any instability 
in the cluster or if completion of the update is delayed, then the log will be 
seen to switch back and forth between v1 and v2. Once the update is completed 
and broker 1 begins using v2, then the message format will stabilize and 
everything will generally be ok.

Downgrades of the message format are problematic, even if they are just 
temporary. There are basically two issues:

1. We use the configured message format version to tell whether down-conversion 
is needed. We assume that the this is always the maximum version used in the 
log, but that assumption fails in the case of a downgrade. In the worst case, 
old clients will see the new format and likely fail.

2. The logic we use for finding the truncation offset during the become 
follower transition does not handle flapping between message formats. When the 
new format is used by the leader, then the epoch cache will be updated 
correctly. When the old format is in use, the epoch cache won't be updated. 
This can lead to an incorrect result to OffsetsForLeaderEpoch queries.

We have actually observed the second problem. The scenario went something like 
this. Broker 1 is the leader of epoch 0 and writes some messages to the log 
using the v2 message format. Broker 2 then becomes the leader for epoch 1 and 
writes some messages in the v2 format. On broker 2, the last entry in the epoch 
cache is epoch 0. No entry is written for epoch 1 because it uses the old 
format. When broker 1 became a follower, it send an OffsetsForLeaderEpoch query 
to broker 2 for epoch 0. Since epoch 0 was the last entry in the cache, the log 
end offset was returned. This resulted in localized log divergence.

There are a few options to fix this problem. From a high level, we can either 
be stricter about preventing downgrades of the message format, or we can add 
additional logic to make downgrades safe. 

(Disallow downgrades): As an example of the first approach, the leader could 
always use the maximum of the last version written to the log and the 
configured message format version. 

(Allow downgrades): If we want to allow downgrades, then it make makes sense to 
invalidate and remove all entries in the epoch cache following the message 
format downgrade. This would basically force us to revert to truncation to the 
high watermark, which is what you'd expect from a downgrade.  We would also 
need a solution for the problem of detecting when down-conversion is needed for 
a fetch request. One option I've been thinking about is enforcing the invariant 
that each segment uses only one message format version. Whenever the message 
format changes, we need to roll a new segment. Then we can simply remember 
which format is in use by each segment to tell whether down-conversion is 
needed for a given fetch request.


  was:
During an upgrade of the message format, there is a short time during which the 
configured message format version is not consistent across all replicas of a 
partition. As long as all brokers are on the same version, this typically does 
not cause any problems. Followers will take whatever message format is used by 
the leader. However, it is possible for leadership to change several times 
between brokers which support the new format and those which support the old 
format. This can cause the version used in the log to flap between the 
different formats until the upgrade is complete. 

For example, suppose broker 1 has been updated to use v2 and broker 2 is still 
on v1. When broker 1 is the leader, all new messages will be written in the v2 
format. When broker 2 is leader, v1 will be used. If there is any instability 
in the cluster or if completion of the update is delayed, then the log will be 
seen to switch back and forth between v1 and v2. Once the update is completed 
and broker 1 begins using v2, then the message format will stabilize and 
everything will generally be ok.

Downgrades of the message format are problematic, even if they are just 
temporary. There are basically two issues:

1. We use the configured message format version to tell whether down-conversion 
is needed. We assume that the this is always the maximum version used in the 
log, but that assumption fails in the case of a downgrade. In the worst case, 
old clients will see the new format and likely fail.

2. The logic we use for finding the truncation offset during the become 
follower transition does not handle flapping between message formats. When the 
new format is used by the leader, then the epoch cache will be updated 
correctly. When the old format is in use, the epoch cache won't be updated. 
This can lead to an incorrect result to OffsetsForLeaderEpoch queries.

We have actually observed the second problem. The scenario went something like 
this. Broker 1 is the leader of epoch 0 and writes some messages to the log 
using the v2 message format. Broker 2 then becomes the leader for epoch 1 and 
writes some messages in the v2 format. On broker 2, the last entry in the epoch 
cache is epoch 0. No entry is written for epoch 1 because it uses the old 
format. When broker 1 became a follower, it send an OffsetsForLeaderEpoch query 
to broker 2 for epoch 0. Since epoch 0 was the last entry in the cache, the log 
end offset was returned. This resulted in localized log divergence.

There are a few options to fix this problem. From a high level, we can either 
be stricter about preventing downgrades of the message format, or we can add 
additional logic to make downgrades safe. 

(Disallow downgrades): As an example of the first approach, the leader could 
always use the maximum of the last version written to the log and the 
configured message format version. 

(Allow downgrades): If we want to allow downgrades, then it make makes sense to 
invalidate and remove all entries in the epoch cache following the message 
format downgrade. This would basically force us to revert to truncation to the 
high watermark, which is what you'd expect from a downgrade.  We would also 
need a solution for the problem of detecting when down-conversion is needed for 
a fetch request. One option I've been thinking about is enforcing the invariant 
that each segment uses only one message format version. Whenever the message 
format changes, we need to roll a new segment. Then we can simply remember 
which format is in use by each segment to tell whether down-conversion is 
needed for a given fetch request.



> Handle message format downgrades during upgrade of message format version
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-7601
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7601
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Gustafson
>            Priority: Major
>
> During an upgrade of the message format, there is a short time during which 
> the configured message format version is not consistent across all replicas 
> of a partition. As long as all brokers are on the same binary version, this 
> typically does not cause any problems. Followers will take whatever message 
> format is used by the leader. However, it is possible for leadership to 
> change several times between brokers which support the new format and those 
> which support the old format. This can cause the version used in the log to 
> flap between the different formats until the upgrade is complete. 
> For example, suppose broker 1 has been updated to use v2 and broker 2 is 
> still on v1. When broker 1 is the leader, all new messages will be written in 
> the v2 format. When broker 2 is leader, v1 will be used. If there is any 
> instability in the cluster or if completion of the update is delayed, then 
> the log will be seen to switch back and forth between v1 and v2. Once the 
> update is completed and broker 1 begins using v2, then the message format 
> will stabilize and everything will generally be ok.
> Downgrades of the message format are problematic, even if they are just 
> temporary. There are basically two issues:
> 1. We use the configured message format version to tell whether 
> down-conversion is needed. We assume that the this is always the maximum 
> version used in the log, but that assumption fails in the case of a 
> downgrade. In the worst case, old clients will see the new format and likely 
> fail.
> 2. The logic we use for finding the truncation offset during the become 
> follower transition does not handle flapping between message formats. When 
> the new format is used by the leader, then the epoch cache will be updated 
> correctly. When the old format is in use, the epoch cache won't be updated. 
> This can lead to an incorrect result to OffsetsForLeaderEpoch queries.
> We have actually observed the second problem. The scenario went something 
> like this. Broker 1 is the leader of epoch 0 and writes some messages to the 
> log using the v2 message format. Broker 2 then becomes the leader for epoch 1 
> and writes some messages in the v2 format. On broker 2, the last entry in the 
> epoch cache is epoch 0. No entry is written for epoch 1 because it uses the 
> old format. When broker 1 became a follower, it send an OffsetsForLeaderEpoch 
> query to broker 2 for epoch 0. Since epoch 0 was the last entry in the cache, 
> the log end offset was returned. This resulted in localized log divergence.
> There are a few options to fix this problem. From a high level, we can either 
> be stricter about preventing downgrades of the message format, or we can add 
> additional logic to make downgrades safe. 
> (Disallow downgrades): As an example of the first approach, the leader could 
> always use the maximum of the last version written to the log and the 
> configured message format version. 
> (Allow downgrades): If we want to allow downgrades, then it make makes sense 
> to invalidate and remove all entries in the epoch cache following the message 
> format downgrade. This would basically force us to revert to truncation to 
> the high watermark, which is what you'd expect from a downgrade.  We would 
> also need a solution for the problem of detecting when down-conversion is 
> needed for a fetch request. One option I've been thinking about is enforcing 
> the invariant that each segment uses only one message format version. 
> Whenever the message format changes, we need to roll a new segment. Then we 
> can simply remember which format is in use by each segment to tell whether 
> down-conversion is needed for a given fetch request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to