[ 
https://issues.apache.org/jira/browse/KUDU-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15922815#comment-15922815
 ] 

Adar Dembo commented on KUDU-1934:
----------------------------------

bq. I don't see "no longer allowing fast heartbeat attempts" in the log.

That warning only shows up once, on the first time that the number of 
consecutive failures is equal to heartbeat_max_failures_before_backoff. So if 
the failures have been occurring for days, it's possible the warning was 
rotated out of the logs. Here's the code:

{noformat}
  // If we've failed a few heartbeats in a row, back off to the normal
  // interval, rather than retrying in a loop.
  if (consecutive_failed_heartbeats_ == 
FLAGS_heartbeat_max_failures_before_backoff) {
    LOG(WARNING) << "Failed " << consecutive_failed_heartbeats_  <<" heartbeats 
"
                 << "in a row: no longer allowing fast heartbeat attempts.";
  }
{noformat}

Besides, the log output you pasted into the bug description shows two heartbeat 
attempts a little over a second apart. Isn't that exactly the "backoff" 
behavior I described?


> tservers aggressively try to reconnect to masters
> -------------------------------------------------
>
>                 Key: KUDU-1934
>                 URL: https://issues.apache.org/jira/browse/KUDU-1934
>             Project: Kudu
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.3.0
>            Reporter: Jean-Daniel Cryans
>              Labels: newbie
>
> Related to KUDU-1933, I had mismatched 1.3 snapshots between the master and 
> the tservers which caused them to try to reconnect to the master infinitely. 
> Since they do it as fast as they can, the logs were quickly full of:
> {noformat}
> I0307 23:55:21.228502 70832 heartbeater.cc:291] Connected to a master server 
> at ve0120.halxg.cloudera.com:7051
> I0307 23:55:21.228528 70832 heartbeater.cc:359] Registering TS with master...
> I0307 23:55:21.228865 70832 heartbeater.cc:389] Master 
> ve0120.halxg.cloudera.com:7051 requested a full tablet report, sending...
> W0307 23:55:21.346961 70832 heartbeater.cc:499] Failed to heartbeat to 
> ve0120.halxg.cloudera.com:7051: Remote error: Failed to send heartbeat to 
> master: Not authorized: invalid CSR: CSR did not contain expected username. 
> (CSR: '' RPC: 'kudu')
> I0307 23:55:22.347733 70832 heartbeater.cc:291] Connected to a master server 
> at ve0120.halxg.cloudera.com:7051
> I0307 23:55:22.347757 70832 heartbeater.cc:359] Registering TS with master...
> I0307 23:55:22.348042 70832 heartbeater.cc:389] Master 
> ve0120.halxg.cloudera.com:7051 requested a full tablet report, sending...
> W0307 23:55:22.467021 70832 heartbeater.cc:499] Failed to heartbeat to 
> ve0120.halxg.cloudera.com:7051: Remote error: Failed to send heartbeat to 
> master: Not authorized: invalid CSR: CSR did not contain expected username. 
> (CSR: '' RPC: 'kudu')
> {noformat}
> Sounds like we should do backoff retries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to