[
https://issues.apache.org/jira/browse/ZOOKEEPER-3701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Kelly updated ZOOKEEPER-3701:
----------------------------------
Affects Version/s: 3.5.6
> Split brain on log disk full
> ----------------------------
>
> Key: ZOOKEEPER-3701
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3701
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.13, 3.5.6
> Reporter: Ivan Kelly
> Priority: Blocker
>
> We ran into a situation where the cluster ended up with split brain when the
> log disk filled up on a node.
> The ZK cluster(3 node) in question was being used as the metadata store for
> Pulsar. There was an outage in the Pulsar cluster, where two the ZK nodes
> filled up there log disks, causing the cluster to lose quorum. Once we
> rectified the full disk situation and restarted the nodes everything seemed
> to work, but we started getting a lot of log messages about
> UpdateMetadataLoop retrying. UpdateMetadataLoop is used to update bookkeeper
> ledger metadata. If it sees a write conflict it rereads the znode, checks
> whether the update needs to happen, applies it and writes. These retries were
> flooding the log on a subset of the brokers. It turned out that it was
> reading a znode with version 0, but when it tried the setData with version
> set to 0 it was failing because the znode had a version of 2 (there were many
> instances of this). After investigating this, we saw that the znode had a
> different stat and value on ZK-1 to that on ZK-0 & ZK-2.
> We resolved the situation by deleting the log and snapshots from ZK-1 and
> restarting, at which point everything went back to normal. Had ZK-1 managed
> to become leader we would have been in a lot of trouble, but thankfully this
> didn't happen.
> For the sequence of events that led to split brain, I'll refer to the
> following code.
> {code}
> public class FileTxnSnapLog {
> ...
> public boolean truncateLog(long zxid) throws IOException {
> // close the existing txnLog and snapLog
> close();
> // truncate it
> FileTxnLog truncLog = new FileTxnLog(dataDir);
> boolean truncated = truncLog.truncate(zxid);
> truncLog.close();
> // re-open the txnLog and snapLog
> // I'd rather just close/reopen this object itself, however that
> // would have a big impact outside ZKDatabase as there are other
> // objects holding a reference to this object.
> txnLog = new FileTxnLog(dataDir);
> snapLog = new FileSnap(snapDir);
> return truncated;
> }
> public void close() throws IOException {
> txnLog.close();
> snapLog.close();
> }
> }
> public class FileSnap implements SnapShot {
> ...
> public synchronized void serialize(DataTree dt, Map<Long, Integer>
> sessions, File snapShot)
> throws IOException {
> if (!close) {
> // actual snapshot code
> }
> }
> @Override
> public synchronized void close() throws IOException {
> close = true;
> }
> }
> {code}
> The sequence of events that lead to the failure are:
> | 2020-01-04 01:56:56Z | ZK-2 fails to write to its transaction log due to
> disk full. ZK-2 is still participating in leader election. ZK-2 becomes a
> follower of ZK-1. ZK-1 sends TRUNC to ZK-2. truncLog.truncate on ZK-2 throws
> an exception because of the disk being full, and leaves the process in a
> broken state. |
> |2020-01-04 02:35:23Z | ZK-2 removes 9 transaction logs from disk (bringing
> it from 100% to 19%). It doesn't recover because its in a broken state. |
> |2020-01-09 08:57:33Z| ZK-1 fails to write to its transaction log due to disk
> full. Restarts as follower. Goes into loop of dropping from quorum (because
> it can't update transaction log)|
> |2020-01-09 08:59:33Z |ZK-1 receives snapshot from leader (ZK-0) (at
> 1e00000000). ZK-1 persists snapshot, but fails to add subsequent transations
> to log due to lack of space. ZK-1 drops from quorum.|
> |2020-01-09 09:00:12Z |ZK-1 joins quorum as follower. 1e00000000 is close
> enough to leader to receive TRUNC(1d0000001d). TRUNC fails because txnLog
> can't flush on close() in trunateLog. ZK-1 goes into loop, dropping and
> joining quorum.|
> |2020-01-09 09:39:00Z |ZK-1 runs purgeTxnLog. Process doesn't recover due to
> truncation exception having broken FileTxnSnapLog.|
> |2020-01-09 19:28:37Z |ZK-1 is restarted. ZK-1 joins quorum as follower. ZK-1
> receives TRUNC(1d0000001d). In this case, txnLog.close() can succeed because
> there's nothing to flush. snapLog is closed. truncLog.truncate fails with
> "java.io.IOException: No log files found to truncate! This could happen if
> you still have snapshots from an old setup or log files were deleted
> accidentally or dataLogDir was changed in zoo.cfg.". It's true that there are
> no log files to truncate because the snapshot is at 1e00000000 which was
> received from the leader at 08:59 and nothing has been logged since. In any
> case, FileTxnSnapLog is in another inconsistent state. snapLog is closed.
> txnLog is closed, but nothing was ever written to it, so it looks like brand
> new.|
> |2020-01-09 19:29:04Z| ZK-2 is restarted. ZK-2 & ZK-0 are now in a good
> state, so they can make progress. Transactions start to be logged.|
> |2020-01-09 19:33:16Z| ZK-1 joins the quorum. As progress has been made, it
> receives a SNAP from the leader at 6b30001183a. It writes a snapshot, which
> ultimately calls FileSnap#serialize. Nothing hits the snapshot disk, because
> FileSnap is in closed state since 19:28. ZK-1 starts logging transactions to
> its log disk.|
> |2020-01-09 19:42:00Z |We do a rolling restart of the cluster.|
> |2020-01-09 19:45:11Z |ZK-1 loads the last snapshot that has been persisted
> to disk (1e00000000), and applies all log entries with zxid greater than the
> snapshot (6b30001183a onwards). |
> |2020-01-09 19:47:35Z |ZK-2 & ZK-1 form a quorum, ZK-2 leading. ZK-1 reports
> its lastZxid as 6b30001b32f and gets a DIFF from ZK-2.|
> From this point on, the cluster has split brain. ZK-1 is missing all
> transaction between 1e00000000 and 6bf0001183a.
> There's a couple of failures in the code that could stop this problem.
> - An exception in truncateLog should nuke the process. Even without the split
> brain occurring, the processes limped on in a broken state for days and
> required human intervention to get going again.
> - snapLog and txnLog should be defensively nulled after they're closed.
> - FileSnap#serialize should not fail silently if close=true. This is really
> bad. It should at least throw an exception.
> The issue occurred with 3.4.13 running on a kubernetes cluster. The bad code
> paths still exist on master.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)