Andrew Wong created KUDU-3074:
---------------------------------
Summary: Consider putting a replica in some read-only, quarantined
state if certain invariants are broken
Key: KUDU-3074
URL: https://issues.apache.org/jira/browse/KUDU-3074
Project: Kudu
Issue Type: Improvement
Components: supportability
Reporter: Andrew Wong
In the past, we've seen some bugs that were caused by broken invariants. These
broken invariants are usually codified as CHECKs, which bring down the process.
While this is an effective way at getting the attention of an operator, who can
then surface the issue to a developer, it'd be great if we could instead put
the tablet into a read-only state (similar to FAILED, but that wouldn't lead to
immediate replication, so the issue is still surfaced to admins) rather than
bringing down the process, which can significantly impact the rest of the
service.
It's tough to gauge what broken invariants actually warrant this (maybe none,
given they're called "invariants"), but for instance, KUDU-2233 is a known bug
that leads to persistent corruption. The CHECK failure can cause the process to
fail; while that's nice to know about, it would be nice if operators could
ignore the issue when appropriate (e.g. successfully replicated elsewhere), and
at worst, be left with multiple failed replicas.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)