Meng Zhu commented on MESOS-10006:

 Cross-posting from slack:

thanks for the ticket! Unfortunately, the log does not contain much useful 
information. Alas, we did not print out the slaveID upon check failure. Sent 
out a patch to print more info upon check failure:
I send out https://reviews.apache.org/r/71581
Consider backport.

Also, some hunch diagnosis: such CHECK failure on sorter function input args 
are almost always bugs on the caller side, in this case, most likely some 
race/inconsistencies between master and allocator during recovery

> Crash in Sorter: "Check failed: resources.contains(slaveId)"
> ------------------------------------------------------------
>                 Key: MESOS-10006
>                 URL: https://issues.apache.org/jira/browse/MESOS-10006
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.1.0, 1.4.1, 1.9.0
>         Environment: Ubuntu Bionic 18.04, Mesos 1.1.0, 1.4.1, 1.9.0 (logs are 
> from 1.9.0).
>            Reporter: Terra Field
>            Priority: Major
>         Attachments: mesos-master.log.gz
> We've hit a similar exception on 3 different versions of the Mesos master 
> (the line #/file name changes but the Check failed is the same), usually when 
> under very high load:
> {noformat}
> F1003 22:06:54.463502  8579 sorter.hpp:339] Check failed: 
> resources.contains(slaveId)
> {noformat}
> This particular occurrence happened after the election of a new master that 
> was then stuck doing framework update broadcasts, as documented in 
> MESOS-10005.

This message was sent by Atlassian Jira

Reply via email to