[
https://issues.apache.org/jira/browse/KUDU-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014620#comment-17014620
]
Andrew Wong commented on KUDU-2404:
-----------------------------------
Thinking about this in the context of the WAL, since out-of-space issues may
very well be transient, we probably don't need to fail the tablet (which is
currently a terminal state). I haven't fully thought this through, but having a
transiently-full disk seems similar in behavior to having a slow WAL disk. That
said, we need to be sure that if we fail to write because we've run out of
space, we should be sure that the in-memory state and cmeta persistent state
associated with whatever operation has just failed is rolled back.
> Mitigate effects of full disks
> ------------------------------
>
> Key: KUDU-2404
> URL: https://issues.apache.org/jira/browse/KUDU-2404
> Project: Kudu
> Issue Type: Improvement
> Components: fs, tserver
> Reporter: Andrew Wong
> Priority: Major
>
> Currently, if a tablet's data directory group runs out of space during a MRS
> or DMS flush, the operation will fail, and the tserver will crash, as MRS and
> DMS flush failures are fatal without the proper care. For disk failures, this
> "care" meant ensuring that upon failing the op, the tablet has started the
> process of shutting down and being failed so it can be replicated elsewhere.
> No such handling currently exists for full disks, although it wouldn't be
> unreasonable to apply the same or similar steps.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)