Hello Will Berkeley, Kudu Jenkins, Andrew Wong,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/12647
to look at the new patch set (#2).
Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................
[TS heartbeater] avoid reconnecting to master too often
With this patch, the heartbeater thread in tservers doesn't reset
its master proxy and reconnect to master (re-negotiating a connection)
every heartbeat under certain conditions. In particular, that happened
if the master was accepting connections and responding to Ping RPC
requests, but was not able to process TS heartbeats properly because
it was still bootstrapping.
E.g., when running RemoteKsckTest.TestClusterWithLocation test scenario
for TSAN builds, I sometimes saw log messages like the following
(the test sets FLAGS_heartbeat_interval_ms = 10):
I0301 20:29:11.932394 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:11.944639 3671 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:11.946904 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:11.960994 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:11.964995 3819 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:11.972220 3671 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:11.974987 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:11.988946 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:11.991653 3671 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.003091 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.017015 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.017540 3671 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.031175 3819 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.031175 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.046165 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.059644 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.073026 3819 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.075335 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.077802 3671 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.089138 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.101193 3671 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.102268 3819 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.104634 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.118392 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.132237 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.147235 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.165709 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.171120 3819 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.179481 3746 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
I0301 20:29:12.191591 3671 heartbeater.cc:345] Connected to a master server at
127.3.75.254:36221
It turned out the counter of the consecutively failed heartbeats kept
increasing because the master was responding with ServiceUnavailable
to incoming TS hearbeats. The prior version of the code did reset
the master proxy every failed heartbeat since
FLAGS_heartbeat_max_failures_before_backoff consecutive errors happened,
and that was the reason behind frequent re-connections to the cluster.
For testing, I just verified that the TS heartbeater no longer behaves
like described above under the same scenarios and conditions.
Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
---
M src/kudu/tserver/heartbeater.cc
1 file changed, 16 insertions(+), 3 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/47/12647/2
--
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <[email protected]>