Alexey Serbin created KUDU-3163:
-----------------------------------
Summary: Long after restarting kudu-tserver nodes, follower
replicas continue rejecting scan requests with 'Uninitialized: safe time has
not yet been initialized' error
Key: KUDU-3163
URL: https://issues.apache.org/jira/browse/KUDU-3163
Project: Kudu
Issue Type: Bug
Components: tserver
Reporter: Alexey Serbin
Attachments: logs.tar.bz2
There was a report on a strange state of tablet replicas after some sort of
rolling restart. ksck with checksum reported the tablet was fine, but follower
replicas continued rejecting scan requests with {{Uninitialized: safe time has
not yet been initialized}} error. It seems the issue went away after forcing
tablet leader re-election. No new write operations (INSERT, UPDATE, DELETE)
were issued against the tablet.
As already mentioned, some nodes in the cluster were restarted, and before
doing that {{\-\-follower_unavailable_considered_failed_sec}} flag was set to
{{3600}}.
At this time, I don't have a clear picture of what was going on, but I just
wanted to dump available information. I need to do a root cause analysis to
produce a clear description and diagnosis for the issue.
The logs are attached (these are filtered tablet server logs containing the
lines attributed only to the affected tablet: UUID
{{c56432b0164e45d98175f26a54d65270}}). At the time when the logs were
captured, {{hdp025}} hosted the leader replica of the tablet, while {{hdp014}}
and {{hdp035}} hosted the follower ones.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)