[GitHub] [bookkeeper] RaulGracia opened a new issue #2482: Bookkeeper client might not close the channel for several minutes after a Bookie crashes

GitBox Fri, 06 Nov 2020 03:47:15 -0800


RaulGracia opened a new issue #2482:
URL: https://github.com/apache/bookkeeper/issues/2482

**BUG REPORT**

***Describe the bug***
We have found a problem, which we believe to be mostly related to the
Bookkeeper client. The problem is as follows: we deploy a 3-Bookie ensemble in
Kubernetes along with Pravega. Then, we reboot in purpose one of the Kubernetes
nodes and this forces the restart of all pods running on it, including one
Bookie. In such deployments, we have multiple Segment Stores connected to
Bookkeeper. To start with, almost all of them correctly detected that a Bookie
was down right away (`2020-10-23 05:24:15` server time):
```
2020-10-23 05:24:15,706 34559029 [bookkeeper-io-3-2] INFO
o.a.b.proto.PerChannelBookieClient - Disconnected from bookie channel [id:
0xa020271c, L:/xxx.176:48038 !
R:bookie-0.bookie-headless.pravega.svc.cluster.local/xxx.146:3181]
```
```
2020-10-23 05:24:15,707 34567544 [bookkeeper-io-3-2] INFO
o.a.b.proto.PerChannelBookieClient - Disconnected from bookie channel [id:
0x578a8730, L:/xxx.92:51952 !
R:bookie-0.bookie-headless.pravega.svc.cluster.local/xxx.146:3181]
```
```
2020-10-23 05:24:15,707 34546304 [bookkeeper-io-3-1] INFO
o.a.b.proto.PerChannelBookieClient - Disconnected from bookie channel [id:
0x7da89b7f, L:/xxx.207:59078 !
R:bookie-0.bookie-headless.pravega.svc.cluster.local/xxx.146:3181]
```
One of the Segment Stores closed the channel when the restarted Bookie was
up again (`2020-10-23 05:41:22` server time):
```
2020-10-23 05:41:22,232 35606108 [bookkeeper-io-3-1] INFO
o.a.b.proto.PerChannelBookieClient - Disconnected from bookie channel [id:
0x3f5b5d81, L:/xxx.9:59906 !
R:bookie-0.bookie-headless.pravega.svc.cluster.local/xxx.146:3181]
```
So far so good, as the system does not introduce any additional delay to
that related to the time taken by the entire node to reboot. However, the
problem is related to one of the Segment Stores, whose _Bookkeeper client took
15 additional minutes to close the Bookie channel_:
```
2020-10-23 05:57:19,762 36520312 [bookkeeper-io-3-2] INFO
o.a.b.proto.PerChannelBookieClient - Disconnected from bookie channel [id:
0x9796f294, L:/xxx.68:40938 !
R:bookie-0.bookie-headless.pravega.svc.cluster.local/xxx.146:3181]
```
The point is that it seems that _it is not until the Bookie channel is
closed that the Bookkeeper client resolves the name of the Bookie again to the
new IP_. Therefore, for 15 minutes, the Bookkeeper client in _that Segment
Store has been attempting to read from the old IP of Bookie 0_. During all this
period of time, the Segment Store was getting `BKReadException`s while trying
to read from Bookie 0, but it looks like this is not enough to re-resolve the
Bookie name:
```
2020-10-23 05:57:08,475 36509025 [core-14] WARN
i.p.s.s.c.StreamSegmentContainer - SegmentContainer[37]: DurableLog failed.
Shutting down StreamSegmentContainer.
io.pravega.segmentstore.storage.DurableDataLogException: Unable to open-read
ledger 11009.
at
io.pravega.segmentstore.storage.impl.bookkeeper.Ledgers.openRead(Ledgers.java:109)
...
Caused by: org.apache.bookkeeper.client.BKException$BKReadException: Error
while reading ledger
```
Apparently, the correct behavior seems to be to trigger
`PerChannelBookieClient:channelInactive` as soon as a Bookie crashes, but
according to our experiments this does not happens always. As an hypothesis,
the fact that we consistently get a delay of 15 minutes to close the Bookie
channel is surprising, and it might indicate that there could be other
underlying timeouts into play, which prevent the connection to get closed and
`channelInactive` from being invoked, but I do not have proof of that.

***To Reproduce***

Steps to reproduce the behavior:
1. Deploy Pravega (https://pravega.io/) in a Kubernetes cluster (e.g.
OpenShift, Kubespray). Our reproductions of the issue use 6 Segment Stores and
3 Bookies (writeQuorum=ensembleQuorum=3).
2. Once the system is up and running, shut down any of the Kubernetes nodes
running a Bookie.
3. In this situation, we have been able to reproduce the behavior in which 1
Segment Store (out of 6) keeps trying to contact the restarted Bookie with the
old IP. This situation keeps like this for over 15 minutes after the Bookie was
restarted, until the Bookkeeper client on that particular Segment Store closes
the channel. All other Segment Stores (and the other Bookies in the system) are
able to detect the failed Bookie as soon as it is restarted and reconnect to it
when it comes up again.

The configuration of our Bookkeeper client can be found here:
https://github.com/pravega/pravega/blob/7e8a26133281a86ca8823c4665fd7d2fb836eeb1/segmentstore/storage/impl/src/main/java/io/pravega/segmentstore/storage/impl/bookkeeper/BookKeeperLogFactory.java#L138

***Expected behavior***

All Bookkeeper clients should be able to detect that a Bookie in the
ensemble is not responsive and reset the channel, so the IP of the Bookie can
be resolved again in the case it has changed.

***Screenshots***

N/A

***Additional context***

According to https://github.com/apache/bookkeeper/pull/1762, we have
experimented by setting `networkaddress.cache.ttl` without success.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [bookkeeper] RaulGracia opened a new issue #2482: Bookkeeper client might not close the channel for several minutes after a Bookie crashes

Reply via email to