Andrew Wong created KUDU-3310:
---------------------------------
Summary: Checksum scan results for lagging replicas can be
confusing
Key: KUDU-3310
URL: https://issues.apache.org/jira/browse/KUDU-3310
Project: Kudu
Issue Type: Improvement
Components: ops-tooling
Reporter: Andrew Wong
When running a checksum scan, we've seen cases where the following is reported:
{code}
Error: Remote error: Service unavailable: Timed out: could not wait for desired
snapshot timestamp to be consistent: Timed out waiting for ts: P: 1621906
798986764 usec, L: 0 to be safe (mode: NON-LEADER). Current safe time: P:
1621906798962044 usec, L: 0 Physical time difference: 0.025s
{code}
and this results in messages like:
{code}
Aborted: checksum scan error: 1 errors were detected
{code}
Without much context about Kudu, this makes it seem like there is some
corruption between replicas, even though the issue is just that the replica is
lagging a bit. We should consider either:
- allowing the wait time to be configured when running the tool, or
- reword the result such that it's clear the scan failed and no checksums were
verified for the tablet
--
This message was sent by Atlassian Jira
(v8.3.4#803005)