[ 
https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671596#comment-15671596
 ] 

Santhosh Kumar Shanmugham commented on MESOS-6575:
--------------------------------------------------

If the task inside the container is not able to make any progress because it 
exhausted its disk quota, the user is probably going to kill it and restart it 
with a different configuration. We can also argue that - by not killing the 
task, it becomes harder for the user to detect tasks that become unhealthy 
after exhaust the disk, and potentially requires changes to the metrics and 
alarms.

We ran into a situation where the container exhausted its disk quota and went 
into an unhealthy state, where even the log message writes were failing due to 
lack of quota.

The {{disk/xfs}} isolator's current behavior would make more sense, if the 
container were resizable.

> Change `disk/xfs` isolator to terminate executor when it exceeds quota
> ----------------------------------------------------------------------
>
>                 Key: MESOS-6575
>                 URL: https://issues.apache.org/jira/browse/MESOS-6575
>             Project: Mesos
>          Issue Type: Task
>          Components: isolation, slave
>            Reporter: Santhosh Kumar Shanmugham
>
> Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf 
> when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on 
> XFS's internal quota enforcement, silently fails the {{write}} operation, 
> that causes the quota limit to be exceeded, without surfacing the quota 
> breach information.
> This task is to change the `disk/xfs` isolator so that, a 
> {{ContainerLimitation}} message is triggered when the quota is exceeded. 
> This feature will rely on the underlying filesystem being mounted with 
> {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes 
> a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the 
> isolator can track the disk quota via {{xfs_quota}}, very much like 
> {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface 
> the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, 
> causing the executor to be terminated. This feature can then be turned on/off 
> via the existing {{enforce_container_disk_quota}} option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to