[
https://issues.apache.org/jira/browse/ZOOKEEPER-3701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Kelly updated ZOOKEEPER-3701:
----------------------------------
Description:
We ran into a situation where the cluster ended up with split brain when the
log disk filled up on a node.
The ZK cluster(3 node) in question was being used as the metadata store for
Pulsar. There was an outage in the Pulsar cluster, where two the ZK nodes
filled up there log disks, causing the cluster to lose quorum. Once we
rectified the full disk situation and restarted the nodes everything seemed to
work, but we started getting a lot of log messages about UpdateMetadataLoop
retrying. UpdateMetadataLoop is used to update bookkeeper ledger metadata. If
it sees a write conflict it rereads the znode, checks whether the update needs
to happen, applies it and writes. These retries were flooding the log on a
subset of the brokers. It turned out that it was reading a znode with version
0, but when it tried the setData with version set to 0 it was failing because
the znode had a version of 2 (there were many instances of this). After
investigating this, we saw that the znode had a different stat and value on
ZK-1 to that on ZK-0 & ZK-2.
We resolved the situation by deleting the log and snapshots from ZK-1 and
restarting, at which point everything went back to normal. Had ZK-1 managed to
become leader we would have been in a lot of trouble, but thankfully this
didn't happen.
For the sequence of events that led to split brain, I'll refer to the following
code.
{code}
public class FileTxnSnapLog {
...
public boolean truncateLog(long zxid) throws IOException {
// close the existing txnLog and snapLog
close();
// truncate it
FileTxnLog truncLog = new FileTxnLog(dataDir);
boolean truncated = truncLog.truncate(zxid);
truncLog.close();
// re-open the txnLog and snapLog
// I'd rather just close/reopen this object itself, however that
// would have a big impact outside ZKDatabase as there are other
// objects holding a reference to this object.
txnLog = new FileTxnLog(dataDir);
snapLog = new FileSnap(snapDir);
return truncated;
}
public void close() throws IOException {
txnLog.close();
snapLog.close();
}
}
public class FileSnap implements SnapShot {
...
public synchronized void serialize(DataTree dt, Map<Long, Integer>
sessions, File snapShot)
throws IOException {
if (!close) {
// actual snapshot code
}
}
@Override
public synchronized void close() throws IOException {
close = true;
}
}
{code}
The sequence of events that lead to the failure are:
| 2020-01-04 01:56:56Z | ZK-2 fails to write to its transaction log due to disk
full. ZK-2 is still participating in leader election. ZK-2 becomes a follower
of ZK-1. ZK-1 sends TRUNC to ZK-2. truncLog.truncate on ZK-2 throws an
exception because of the disk being full, and leaves the process in a broken
state. |
|2020-01-04 02:35:23Z | ZK-2 removes 9 transaction logs from disk (bringing it
from 100% to 19%). It doesn't recover because its in a broken state. |
|2020-01-09 08:57:33Z| ZK-1 fails to write to its transaction log due to disk
full. Restarts as follower. Goes into loop of dropping from quorum (because it
can't update transaction log)|
|2020-01-09 08:59:33Z |ZK-1 receives snapshot from leader (ZK-0) (at
1e00000000). ZK-1 persists snapshot, but fails to add subsequent transations to
log due to lack of space. ZK-1 drops from quorum.|
|2020-01-09 09:00:12Z |ZK-1 joins quorum as follower. 1e00000000 is close
enough to leader to receive TRUNC(1d0000001d). TRUNC fails because txnLog can't
flush on close() in trunateLog. ZK-1 goes into loop, dropping and joining
quorum.|
|2020-01-09 09:39:00Z |ZK-1 runs purgeTxnLog. Process doesn't recover due to
truncation exception having broken FileTxnSnapLog.|
|2020-01-09 19:28:37Z |ZK-1 is restarted. ZK-1 joins quorum as follower. ZK-1
receives TRUNC(1d0000001d). In this case, txnLog.close() can succeed because
there's nothing to flush. snapLog is closed. truncLog.truncate fails with
"java.io.IOException: No log files found to truncate! This could happen if you
still have snapshots from an old setup or log files were deleted accidentally
or dataLogDir was changed in zoo.cfg.". It's true that there are no log files
to truncate because the snapshot is at 1e00000000 which was received from the
leader at 08:59 and nothing has been logged since. In any case, FileTxnSnapLog
is in another inconsistent state. snapLog is closed. txnLog is closed, but
nothing was ever written to it, so it looks like brand new.|
|2020-01-09 19:29:04Z| ZK-2 is restarted. ZK-2 & ZK-0 are now in a good state,
so they can make progress. Transactions start to be logged.|
|2020-01-09 19:33:16Z| ZK-1 joins the quorum. As progress has been made, it
receives a SNAP from the leader at 6b30001183a. It writes a snapshot, which
ultimately calls FileSnap#serialize. Nothing hits the snapshot disk, because
FileSnap is in closed state since 19:28. ZK-1 starts logging transactions to
its log disk.|
|2020-01-09 19:42:00Z |We do a rolling restart of the cluster.|
|2020-01-09 19:45:11Z |ZK-1 loads the last snapshot that has been persisted to
disk (1e00000000), and applies all log entries with zxid greater than the
snapshot (6b30001183a onwards). |
|2020-01-09 19:47:35Z |ZK-2 & ZK-1 form a quorum, ZK-2 leading. ZK-1 reports
its lastZxid as 6b30001b32f and gets a DIFF from ZK-2.|
>From this point on, the cluster has split brain. ZK-1 is missing all
>transaction between 1e00000000 and 6bf0001183a.
There's a couple of failures in the code that could stop this problem.
- An exception in truncateLog should nuke the process. Even without the split
brain occurring, the processes limped on in a broken state for days and
required human intervention to get going again.
- snapLog and txnLog should be defensively nulled after they're closed.
- FileSnap#serialize should not fail silently if close=true. This is really
bad. It should at least throw an exception.
The issue occurred with 3.4.13 running on a kubernetes cluster. The bad code
paths still exist on master.
was:
We ran into a situation where the cluster ended up with split brain when the
log disk filled up on a node.
The ZK cluster(3 node) in question was being used as the metadata store for
Pulsar. There was an outage in the Pulsar cluster, where two the ZK nodes
filled up there log disks, causing the cluster to lose quorum. Once we
rectified the full disk situation and restarted the nodes everything seemed to
work, but we started getting a lot of log messages about UpdateMetadataLoop
retrying. UpdateMetadataLoop is used to update bookkeeper ledger metadata. If
it sees a write conflict it rereads the znode, checks whether the update needs
to happen, applies it and writes. These retries were flooding the log on a
subset of the brokers. It turned out that it was reading a znode with version
0, but when it tried the setData with version set to 0 it was failing because
the znode had a version of 2 (there were many instances of this). After
investigating this, we saw that the znode had a different stat and value on
ZK-1 to that on ZK-0 & ZK-2.
We resolved the situation by deleting the log and snapshots from ZK-1 and
restarting, at which point everything went back to normal. Had ZK-1 managed to
become leader we would have been in a lot of trouble, but thankfully this
didn't happen.
For the sequence of events that led to split brain, I'll refer to the following
code.
{code}
public class FileTxnSnapLog {
...
public boolean truncateLog(long zxid) throws IOException {
// close the existing txnLog and snapLog
close();
// truncate it
FileTxnLog truncLog = new FileTxnLog(dataDir);
boolean truncated = truncLog.truncate(zxid);
truncLog.close();
// re-open the txnLog and snapLog
// I'd rather just close/reopen this object itself, however that
// would have a big impact outside ZKDatabase as there are other
// objects holding a reference to this object.
txnLog = new FileTxnLog(dataDir);
snapLog = new FileSnap(snapDir);
return truncated;
}
public void close() throws IOException {
txnLog.close();
snapLog.close();
}
}
public class FileSnap implements SnapShot {
...
public synchronized void serialize(DataTree dt, Map<Long, Integer>
sessions, File snapShot)
throws IOException {
if (!close) {
// actual snapshot code
}
}
@Override
public synchronized void close() throws IOException {
close = true;
}
}
{code}
The sequence of events that lead to the failure are:
| 2020-01-04 01:56:56Z | ZK-2 fails to write to its transaction log due to disk
full. ZK-2 is still participating in leader election. ZK-2 becomes a follower
of ZK-1. ZK-1 sends TRUNC to ZK-2. truncLog.truncate on ZK-2 throws an
exception because of the disk being full, and leaves the process in a broken
state. |
|2020-01-04 02:35:23Z | ZK-2 removes 9 transaction logs from disk (bringing it
from 100% to 19%). It doesn't recover because its in a broken state. |
|2020-01-09 08:57:33Z| ZK-1 fails to write to its transaction log due to disk
full. Restarts as follower. Goes into loop of dropping from quorum (because it
can't update transaction log)|
|2020-01-09 08:59:33Z |ZK-1 receives snapshot from leader (ZK-0) (at
1e00000000). ZK-1 persists snapshot, but fails to add subsequent transations to
log due to lack of space. ZK-1 drops from quorum.|
|2020-01-09 09:00:12Z |ZK-1 joins quorum as follower. 1e00000000 is close
enough to leader to receive TRUNC(1d0000001d). TRUNC fails because txnLog can't
flush on close() in trunateLog. ZK-1 goes into loop, dropping and joining
quorum.|
|2020-01-09 09:39:00Z |ZK-1 runs purgeTxnLog. Process doesn't recover due to
truncation exception having broken FileTxnSnapLog.|
|2020-01-09 19:28:37Z |ZK-1 is restarted. ZK-1 joins quorum as follower. ZK-1
receives TRUNC(1d0000001d). In this case, txnLog.close() can succeed because
there's nothing to flush. snapLog is closed. truncLog.truncate fails with
"java.io.IOException: No log files found to truncate! This could happen if you
still have snapshots from an old setup or log files were deleted accidentally
or dataLogDir was changed in zoo.cfg.". It's true that there are no log files
to truncate because the snapshot is at 1e00000000 which was received from the
leader at 08:59 and nothing has been logged since. In any case, FileTxnSnapLog
is in another inconsistent state. snapLog is closed. txnLog is closed, but
nothing was ever written to it, so it looks like brand new.|
|2020-01-09 19:29:04Z| ZK-2 is restarted. ZK-2 & ZK-0 are now in a good state,
so they can make progress. Transactions start to be logged.|
|2020-01-09 19:33:16Z| ZK-1 joins the quorum. As progress has been made, it
receives a SNAP from the leader at 6b30001183a. It writes a snapshot, which
ultimately calls FileSnap#serialize. Nothing hits the snapshot disk, because
FileSnap is in closed state since 19:28. ZK-1 starts logging transactions to
its log disk.|
|2020-01-09 19:42:00Z |We do a rolling restart of the cluster.|
|2020-01-09 19:45:11Z |ZK-1 loads the last snapshot that has been persisted to
disk (1e00000000), and applies all log entries with zxid greater than the
snapshot (6b30001183a onwards). |
|2020-01-09 19:47:35Z |ZK-2 & ZK-1 form a quorum, ZK-2 leading. ZK-1 reports
its lastZxid as 6b30001b32f and gets a DIFF from ZK-2.|
>From this point on, the cluster has split brain. ZK-1 is missing all
>transaction between 1e00000000 and 6bf0001183a.
There's a couple of failures in the code that could stop this problem.
- An exception in truncateLog should nuke the process. Even without the split
brain occurring, the processes limped on in a broken state for days and
required human intervention to get going again.
- snapLog and txnLog should be defensively nulled after they're closed.
- FileSnap#serialize should not fail silently if close=true. This is really
bad. It should at least throw an exception.
The issue occurred with 3.4.13 running on a kubernetes cluster. The bad code
paths still exist on master.
> Split brain on log disk full
> ----------------------------
>
> Key: ZOOKEEPER-3701
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3701
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.13, 3.5.6
> Reporter: Ivan Kelly
> Assignee: Andor Molnar
> Priority: Blocker
>
> We ran into a situation where the cluster ended up with split brain when the
> log disk filled up on a node.
> The ZK cluster(3 node) in question was being used as the metadata store for
> Pulsar. There was an outage in the Pulsar cluster, where two the ZK nodes
> filled up there log disks, causing the cluster to lose quorum. Once we
> rectified the full disk situation and restarted the nodes everything seemed
> to work, but we started getting a lot of log messages about
> UpdateMetadataLoop retrying. UpdateMetadataLoop is used to update bookkeeper
> ledger metadata. If it sees a write conflict it rereads the znode, checks
> whether the update needs to happen, applies it and writes. These retries were
> flooding the log on a subset of the brokers. It turned out that it was
> reading a znode with version 0, but when it tried the setData with version
> set to 0 it was failing because the znode had a version of 2 (there were many
> instances of this). After investigating this, we saw that the znode had a
> different stat and value on ZK-1 to that on ZK-0 & ZK-2.
> We resolved the situation by deleting the log and snapshots from ZK-1 and
> restarting, at which point everything went back to normal. Had ZK-1 managed
> to become leader we would have been in a lot of trouble, but thankfully this
> didn't happen.
> For the sequence of events that led to split brain, I'll refer to the
> following code.
> {code}
> public class FileTxnSnapLog {
> ...
> public boolean truncateLog(long zxid) throws IOException {
> // close the existing txnLog and snapLog
> close();
> // truncate it
> FileTxnLog truncLog = new FileTxnLog(dataDir);
> boolean truncated = truncLog.truncate(zxid);
> truncLog.close();
> // re-open the txnLog and snapLog
> // I'd rather just close/reopen this object itself, however that
> // would have a big impact outside ZKDatabase as there are other
> // objects holding a reference to this object.
> txnLog = new FileTxnLog(dataDir);
> snapLog = new FileSnap(snapDir);
> return truncated;
> }
> public void close() throws IOException {
> txnLog.close();
> snapLog.close();
> }
> }
> public class FileSnap implements SnapShot {
> ...
> public synchronized void serialize(DataTree dt, Map<Long, Integer>
> sessions, File snapShot)
> throws IOException {
> if (!close) {
> // actual snapshot code
> }
> }
> @Override
> public synchronized void close() throws IOException {
> close = true;
> }
> }
> {code}
> The sequence of events that lead to the failure are:
> | 2020-01-04 01:56:56Z | ZK-2 fails to write to its transaction log due to
> disk full. ZK-2 is still participating in leader election. ZK-2 becomes a
> follower of ZK-1. ZK-1 sends TRUNC to ZK-2. truncLog.truncate on ZK-2 throws
> an exception because of the disk being full, and leaves the process in a
> broken state. |
> |2020-01-04 02:35:23Z | ZK-2 removes 9 transaction logs from disk (bringing
> it from 100% to 19%). It doesn't recover because its in a broken state. |
> |2020-01-09 08:57:33Z| ZK-1 fails to write to its transaction log due to disk
> full. Restarts as follower. Goes into loop of dropping from quorum (because
> it can't update transaction log)|
> |2020-01-09 08:59:33Z |ZK-1 receives snapshot from leader (ZK-0) (at
> 1e00000000). ZK-1 persists snapshot, but fails to add subsequent transations
> to log due to lack of space. ZK-1 drops from quorum.|
> |2020-01-09 09:00:12Z |ZK-1 joins quorum as follower. 1e00000000 is close
> enough to leader to receive TRUNC(1d0000001d). TRUNC fails because txnLog
> can't flush on close() in trunateLog. ZK-1 goes into loop, dropping and
> joining quorum.|
> |2020-01-09 09:39:00Z |ZK-1 runs purgeTxnLog. Process doesn't recover due to
> truncation exception having broken FileTxnSnapLog.|
> |2020-01-09 19:28:37Z |ZK-1 is restarted. ZK-1 joins quorum as follower. ZK-1
> receives TRUNC(1d0000001d). In this case, txnLog.close() can succeed because
> there's nothing to flush. snapLog is closed. truncLog.truncate fails with
> "java.io.IOException: No log files found to truncate! This could happen if
> you still have snapshots from an old setup or log files were deleted
> accidentally or dataLogDir was changed in zoo.cfg.". It's true that there are
> no log files to truncate because the snapshot is at 1e00000000 which was
> received from the leader at 08:59 and nothing has been logged since. In any
> case, FileTxnSnapLog is in another inconsistent state. snapLog is closed.
> txnLog is closed, but nothing was ever written to it, so it looks like brand
> new.|
> |2020-01-09 19:29:04Z| ZK-2 is restarted. ZK-2 & ZK-0 are now in a good
> state, so they can make progress. Transactions start to be logged.|
> |2020-01-09 19:33:16Z| ZK-1 joins the quorum. As progress has been made, it
> receives a SNAP from the leader at 6b30001183a. It writes a snapshot, which
> ultimately calls FileSnap#serialize. Nothing hits the snapshot disk, because
> FileSnap is in closed state since 19:28. ZK-1 starts logging transactions to
> its log disk.|
> |2020-01-09 19:42:00Z |We do a rolling restart of the cluster.|
> |2020-01-09 19:45:11Z |ZK-1 loads the last snapshot that has been persisted
> to disk (1e00000000), and applies all log entries with zxid greater than the
> snapshot (6b30001183a onwards). |
> |2020-01-09 19:47:35Z |ZK-2 & ZK-1 form a quorum, ZK-2 leading. ZK-1 reports
> its lastZxid as 6b30001b32f and gets a DIFF from ZK-2.|
> From this point on, the cluster has split brain. ZK-1 is missing all
> transaction between 1e00000000 and 6bf0001183a.
> There's a couple of failures in the code that could stop this problem.
> - An exception in truncateLog should nuke the process. Even without the split
> brain occurring, the processes limped on in a broken state for days and
> required human intervention to get going again.
> - snapLog and txnLog should be defensively nulled after they're closed.
> - FileSnap#serialize should not fail silently if close=true. This is really
> bad. It should at least throw an exception.
> The issue occurred with 3.4.13 running on a kubernetes cluster. The bad code
> paths still exist on master.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)