[
https://issues.apache.org/jira/browse/KUDU-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Wong reassigned KUDU-3191:
---------------------------------
Assignee: Andrew Wong
> Fail tablet replicas that suffer from KUDU-2233 instead of crashing
> -------------------------------------------------------------------
>
> Key: KUDU-3191
> URL: https://issues.apache.org/jira/browse/KUDU-3191
> Project: Kudu
> Issue Type: Task
> Components: compaction
> Reporter: Andrew Wong
> Assignee: Andrew Wong
> Priority: Major
>
> KUDU-2233 results in persisted corruption that causes a broken invariant,
> leading to a server crash. The recovery process for this corruption is
> arduous, especially if there are multiple tablet replicas in a given server
> that suffer from it -- users typically start the server, see the crash,
> remove the affected replica manually via tooling, and restart, repeatedly
> until the server comes up healthily.
> Instead, we should consider treating this as we do CFile block-level
> corruption[1] and fail the tablet replica. At best, we end up recovering from
> a non-corrupted replica. At worst, we'd end up with multiple corrupted
> replicas, which is still better than what we have today, which is multiple
> corrupted replicas and unavailable servers that lead to excessive
> re-replication.
> [1]
> https://github.com/apache/kudu/commit/cf6927cb153f384afb649b664de1d4276bd6d83f
--
This message was sent by Atlassian Jira
(v8.3.4#803005)