[ 
https://issues.apache.org/jira/browse/KUDU-2404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014620#comment-17014620
 ] 

Andrew Wong commented on KUDU-2404:
-----------------------------------

Thinking about this in the context of the WAL, since out-of-space issues may 
very well be transient, we probably don't need to fail the tablet (which is 
currently a terminal state). I haven't fully thought this through, but having a 
transiently-full disk seems similar in behavior to having a slow WAL disk. That 
said, we need to be sure that if we fail to write because we've run out of 
space, we should be sure that the in-memory state and cmeta persistent state 
associated with whatever operation has just failed is rolled back.

> Mitigate effects of full disks
> ------------------------------
>
>                 Key: KUDU-2404
>                 URL: https://issues.apache.org/jira/browse/KUDU-2404
>             Project: Kudu
>          Issue Type: Improvement
>          Components: fs, tserver
>            Reporter: Andrew Wong
>            Priority: Major
>
> Currently, if a tablet's data directory group runs out of space during a MRS 
> or DMS flush, the operation will fail, and the tserver will crash, as MRS and 
> DMS flush failures are fatal without the proper care. For disk failures, this 
> "care" meant ensuring that upon failing the op, the tablet has started the 
> process of shutting down and being failed so it can be replicated elsewhere. 
> No such handling currently exists for full disks, although it wouldn't be 
> unreasonable to apply the same or similar steps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to