Arun M J created MESOS-10203: -------------------------------- Summary: Agent process crashes on newer linux kernels if 'linux/capabilities' isolation is enbaled Key: MESOS-10203 URL: https://issues.apache.org/jira/browse/MESOS-10203 Project: Mesos Issue Type: Bug Components: agent Reporter: Arun M J
Mesos agent crashes with following stack trace on newer Linux kernels (>=5.8.x) if started with MESOS_ISOLATION=linux/capabilities. Tested on {color:#5454ff}5.7.19{color} where it was running fine, but fails on {color:#000000}5.8.18{color},{color:#000000}5.9.11 {color}and {color:#000000}5.10{color} {quote}{{Dec 13 05:08:28 mesosbox mesos-agent[465]: sh: hadoop: command not found}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: I1213 05:08:28.234824 458 fetcher.cpp:66] Skipping URI fetcher plugin 'hadoop' as it could not be created: Failed to create HDFS client: Hadoop client is not available, exit status: 32512}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: Reached unreachable statement at linux/capabilities.cpp:497}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: *** Aborted at 1607836108 (unix time) try "date -d @1607836108" if you are using GNU date ***}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: PC: @ 0x7f875bd62387 __GI_raise}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: *** SIGABRT (@0x1ca) received by PID 458 (TID 0x7f8760ddca00) from PID 458; stack trace: ***}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875c626630 (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875bd62387 __GI_raise}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875bd63a78 __GI_abort}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875e60f237 (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ef6e7c1 (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ef723cc (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ef70c96 (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875f05389d (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ed837fc (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ed72332 (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875ecf54c6 (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x55f5d9c1a256 (unknown)}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x7f875bd4e555 __libc_start_main}} {{Dec 13 05:08:28 mesosbox mesos-agent[466]: @ 0x55f5d9c1d10f (unknown)}} {{Dec 13 05:08:28 mesosbox kernel: audit: type=1701 audit(1607836108.250:274): auid=4294967295 uid=0 gid=0 ses=4294967295 subj==unconfined pid=4772 comm="mesos-agent" exe="/usr/sbin/mesos-agent" sig=6 res=1}} {quote} When looked further, I could find out that this was raised from [linux/capabilities.cpp|https://github.com/apache/mesos/blob/206da612c0aada0b1d86beb63660d9083b774894/src/linux/capabilities.cpp#L495-L502] which converts capability enum values to human-readable names. {code:java} ostream& operator<<(ostream& stream, const Capability& capability) { switch (capability) { case CHOWN: return stream << "CHOWN"; case DAC_OVERRIDE: return stream << "DAC_OVERRIDE"; case AUDIT_READ: return stream << "AUDIT_READ"; ... ... case MAX_CAPABILITY: UNREACHABLE(); // !!! Crash site } UNREACHABLE(); } {code} [MAX_CAPABILITY|https://github.com/apache/mesos/blob/206da612c0aada0b1d86beb63660d9083b774894/src/linux/capabilities.hpp#L75] is defined as *38*. But as of now, the new capabilities were introduced to Linux. Namely, * *CAP_PERFMON*=38 // (since Linux 5.8) - Employ various performance-monitoring mechanisms * *CAP_BPF*=39 // (since Linux 5.8) - Employ privileged BPF operations; * *CAP_CHECKPOINT_RESTORE*=40 ** (since Linux 5.9) - Allow checkpoint/restore related operations ref: [https://github.com/torvalds/linux/blob/master/include/uapi/linux/capability.h] Above Mesos code does not seem to respect such kernel evolutions. So adding new capability on Kernel will break the Isolator. -- This message was sent by Atlassian Jira (v8.3.4#803005)