Andrew Wong created KUDU-3074:
---------------------------------

             Summary: Consider putting a replica in some read-only, quarantined 
state if certain invariants are broken
                 Key: KUDU-3074
                 URL: https://issues.apache.org/jira/browse/KUDU-3074
             Project: Kudu
          Issue Type: Improvement
          Components: supportability
            Reporter: Andrew Wong


In the past, we've seen some bugs that were caused by broken invariants. These 
broken invariants are usually codified as CHECKs, which bring down the process.

While this is an effective way at getting the attention of an operator, who can 
then surface the issue to a developer, it'd be great if we could instead put 
the tablet into a read-only state (similar to FAILED, but that wouldn't lead to 
immediate replication, so the issue is still surfaced to admins) rather than 
bringing down the process, which can significantly impact the rest of the 
service.

It's tough to gauge what broken invariants actually warrant this (maybe none, 
given they're called "invariants"), but for instance, KUDU-2233 is a known bug 
that leads to persistent corruption. The CHECK failure can cause the process to 
fail; while that's nice to know about, it would be nice if operators could 
ignore the issue when appropriate (e.g. successfully replicated elsewhere), and 
at worst, be left with multiple failed replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to