[jira] [Commented] (MESOS-2002) Module loading within frameworks
[ https://issues.apache.org/jira/browse/MESOS-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204763#comment-14204763 ] Till Toenshoff commented on MESOS-2002: --- https://reviews.apache.org/r/27806/ Module loading within frameworks - Key: MESOS-2002 URL: https://issues.apache.org/jira/browse/MESOS-2002 Project: Mesos Issue Type: Improvement Components: framework, modules Reporter: Till Toenshoff Assignee: Till Toenshoff Priority: Blocker Frameworks should be granted the capability to load modules. h4.Motivation Allowing a modularized Authenticatee to cover framework authentication against the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-809) External control of the ip that Mesos components publish to zookeeper
[ https://issues.apache.org/jira/browse/MESOS-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14204931#comment-14204931 ] Jay Buffington commented on MESOS-809: -- When we fix this we should update the master/http.cpp code which has this: {noformat} const string Master::Http::REDIRECT_HELP = HELP( TLDR( Redirects to the leading Master.), USAGE( /master/redirect), DESCRIPTION( This returns a 307 Temporary Redirect to the leading Master., If no Master is leading (according to this Master), then the, Master will redirect to itself., , **NOTES:**, 1. This is the recommended way to bookmark the WebUI when, running multiple Masters., 2. This is broken currently \on the cloud\ (e.g. EC2) as, this will attempt to redirect to the private IP address.)); {noformat} Fixing this issue should resolve #2 in the notes. External control of the ip that Mesos components publish to zookeeper - Key: MESOS-809 URL: https://issues.apache.org/jira/browse/MESOS-809 Project: Mesos Issue Type: Improvement Components: framework, master, slave Affects Versions: 0.14.2 Reporter: Khalid Goudeaux Priority: Minor With tools like Docker making containers more manageable, it's tempting to use containers for all software installation. The CoreOS project is an example of this. When an application is run inside a container it sees a different ip/hostname from the host system running the container. That ip is only valid from inside that host, no other machine can see it. From inside a container, the Mesos master and slave publish that private ip to zookeeper and as a result they can't find each other if they're on different machines. The --ip option can't help because the public ip isn't available for binding from within a container. Essentially, from inside the container, mesos processes don't know the ip they're available at (they may not know the port either). It would be nice to bootstrap the processes with the correct ip for them to publish to zookeeper. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2056) Refactor fetcher code in preparation for fetcher cache
Bernd Mathiske created MESOS-2056: - Summary: Refactor fetcher code in preparation for fetcher cache Key: MESOS-2056 URL: https://issues.apache.org/jira/browse/MESOS-2056 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Priority: Minor Refactor/rearrange fetcher-related code so that cache functionality can be dropped in. One could do both together in one go. This is splitting up reviews into smaller chunks. It will not immediately be obvious how this change will be used later, but it will look better-factored and still do the exact same thing as before. In particular, a download routine to be reused several times in launcher/fetcher will be factored out and the remainder of fetcher-related code can be moved from the containerizer realm into fetcher.cpp. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2057) Add cache functionality with concurrent downloading to fetcher
Bernd Mathiske created MESOS-2057: - Summary: Add cache functionality with concurrent downloading to fetcher Key: MESOS-2057 URL: https://issues.apache.org/jira/browse/MESOS-2057 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2057) Add cache functionality with concurrent downloading to fetcher
[ https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Mathiske updated MESOS-2057: -- Description: Add a URI flag to CommandInfo messages that indicates caching. When a URI is cached it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. No cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Depends on MESOS-2056. Add cache functionality with concurrent downloading to fetcher -- Key: MESOS-2057 URL: https://issues.apache.org/jira/browse/MESOS-2057 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Add a URI flag to CommandInfo messages that indicates caching. When a URI is cached it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. No cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Depends on MESOS-2056. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2057) Add cache functionality with concurrent downloading to fetcher
[ https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Mathiske updated MESOS-2057: -- Description: Add a URI flag to CommandInfo messages that indicates caching. When a URI is cached it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. Different URIs from different CommandInfos can be downloaded concurrently. No cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Depends on MESOS-2056. was: Add a URI flag to CommandInfo messages that indicates caching. When a URI is cached it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. No cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Depends on MESOS-2056. Add cache functionality with concurrent downloading to fetcher -- Key: MESOS-2057 URL: https://issues.apache.org/jira/browse/MESOS-2057 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Add a URI flag to CommandInfo messages that indicates caching. When a URI is cached it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. Different URIs from different CommandInfos can be downloaded concurrently. No cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Depends on MESOS-2056. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2057) Add cache functionality with concurrent downloading to fetcher
[ https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bernd Mathiske updated MESOS-2057: -- Description: Add a URI flag to CommandInfo messages that indicates caching. When a URI is cached it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. Different URIs from different CommandInfos can be downloaded concurrently. No cache eviction, cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Depends on MESOS-2056. was: Add a URI flag to CommandInfo messages that indicates caching. When a URI is cached it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. Different URIs from different CommandInfos can be downloaded concurrently. No cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Depends on MESOS-2056. Add cache functionality with concurrent downloading to fetcher -- Key: MESOS-2057 URL: https://issues.apache.org/jira/browse/MESOS-2057 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Add a URI flag to CommandInfo messages that indicates caching. When a URI is cached it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. Different URIs from different CommandInfos can be downloaded concurrently. No cache eviction, cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Depends on MESOS-2056. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1919) Create IP address abstraction
[ https://issues.apache.org/jira/browse/MESOS-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205097#comment-14205097 ] Dominic Hamon commented on MESOS-1919: -- I think trying to bind on both seems reasonable. The abstraction for connecting should also try both, probably using [happy eyeballs|http://en.wikipedia.org/wiki/Happy_Eyeballs]. libprocess should almost certainly be extended to use a socket address instead of uint32_t _ip_ to store it. Create IP address abstraction - Key: MESOS-1919 URL: https://issues.apache.org/jira/browse/MESOS-1919 Project: Mesos Issue Type: Task Components: libprocess Reporter: Dominic Hamon Assignee: Evelina Dumitrescu Priority: Minor in the code many functions need only the ip address to be passed as a parameter. I don't think it would be desirable to use a struct SockaddrStorage (MESOS-1916). Consider using a {{std::vectorunsigned char}} (see {{typedef std::vectorunsigned char IPAddressNumber;}} in the Chromium project) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1876) Remove deprecated 'slave_id' field in ReregisterSlaveMessage.
[ https://issues.apache.org/jira/browse/MESOS-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1876: - Labels: newbie twitter (was: newbie) Remove deprecated 'slave_id' field in ReregisterSlaveMessage. - Key: MESOS-1876 URL: https://issues.apache.org/jira/browse/MESOS-1876 Project: Mesos Issue Type: Task Components: technical debt Reporter: Benjamin Mahler Priority: Trivial Labels: newbie, twitter This is to follow through on removing the deprecated field that we've been phasing out. In 0.21.0, this field will no longer be read: {code} message ReregisterSlaveMessage { // TODO(bmahler): slave_id is deprecated. // 0.21.0: Now an optional field. Always written, never read. // 0.22.0: Remove this field. optional SlaveID slave_id = 1; required SlaveInfo slave = 2; repeated ExecutorInfo executor_infos = 4; repeated Task tasks = 3; repeated Archive.Framework completed_frameworks = 5; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1867) Precision errors in UI
[ https://issues.apache.org/jira/browse/MESOS-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1867: - Labels: easyfix newbie (was: easyfix) Precision errors in UI -- Key: MESOS-1867 URL: https://issues.apache.org/jira/browse/MESOS-1867 Project: Mesos Issue Type: Bug Components: master Affects Versions: 0.20.0 Environment: mesos 0.20.0, 3 masters, 25 slaves Reporter: Ian Babrou Priority: Trivial Labels: easyfix, newbie Just look at the image: http://i.imgur.com/oFx1M7B.png I have ~2500 completed tasks from Chronos, 256mb and 0.1 cpu each. At least UI should be fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1753) Allow default/deleted functions
[ https://issues.apache.org/jira/browse/MESOS-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1753: - Labels: c++11 twitter (was: c++11) Allow default/deleted functions --- Key: MESOS-1753 URL: https://issues.apache.org/jira/browse/MESOS-1753 Project: Mesos Issue Type: Improvement Reporter: Dominic Hamon Priority: Minor Labels: c++11, twitter Add default/delete functions to the configure script. Once there, we can start using them across the code-base. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1708) Using the wrong resource name should report a better error.
[ https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1708: - Labels: newbie twitter (was: newbie) Using the wrong resource name should report a better error. - Key: MESOS-1708 URL: https://issues.apache.org/jira/browse/MESOS-1708 Project: Mesos Issue Type: Bug Components: framework, master Reporter: Benjamin Hindman Labels: newbie, twitter If a scheduler launches a task using resources the master doesn't know about the task validator causes the task to fail but the error message is not very helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1622) Move implementations from .hpp to .cpp
[ https://issues.apache.org/jira/browse/MESOS-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon reassigned MESOS-1622: Assignee: Dominic Hamon Move implementations from .hpp to .cpp -- Key: MESOS-1622 URL: https://issues.apache.org/jira/browse/MESOS-1622 Project: Mesos Issue Type: Story Components: technical debt Reporter: Isabel Jimenez Assignee: Dominic Hamon Priority: Minor Labels: newbie This issue is related to MESOS-1582, some headers have unnecessary inline definitions and function declarations, to speed up build time we are lightening headers. This issue will not apply to stout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1517) Maintain a queue of messages that arrive before the master recovers.
[ https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1517: - Labels: reliability twitter (was: reliability) Maintain a queue of messages that arrive before the master recovers. Key: MESOS-1517 URL: https://issues.apache.org/jira/browse/MESOS-1517 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Labels: reliability, twitter Currently when the master is recovering, we drop all incoming messages. If slaves and frameworks knew about the leading master only once it has recovered, then we would only expect to see messages after we've recovered. We previously considered enqueuing all messages through the recovery future, but this has the downside of forcing all messages to go through the master's queue twice: {code} // TODO(bmahler): Consider instead re-enqueing *all* messages // through recover(). What are the performance implications of // the additional queueing delay and the accumulated backlog // of messages post-recovery? if (!recovered.get().isReady()) { VLOG(1) Dropping ' event.message-name ' message since not recovered yet; ++metrics.dropped_messages; return; } {code} However, an easy solution to this problem is to maintain an explicit queue of incoming messages that gets flushed once we finish recovery. This ensures that all messages post-recovery are processed normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1424) Mesos tests should not rely on echo
[ https://issues.apache.org/jira/browse/MESOS-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1424: - Labels: cleanup easyfix newbie osx tests (was: cleanup easyfix osx tests) Mesos tests should not rely on echo --- Key: MESOS-1424 URL: https://issues.apache.org/jira/browse/MESOS-1424 Project: Mesos Issue Type: Improvement Components: test Reporter: Till Toenshoff Priority: Minor Labels: cleanup, easyfix, newbie, osx, tests Triggered by MESOS-1413 I would like to propose changing our tests to not rely on {{echo}} but to use {{printf}} instead. This seems to be useful as {{echo}} is introducing an extra linefeed after the supplied string whereas {{printf}} does not. The {{-n}} switch preventing that extra linefeed is unfortunately not portable - it is not supported by the builtin {{echo}} of the BSD / OSX {{/bin/sh}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1355) Use of untrusted string value in jvm.cpp
[ https://issues.apache.org/jira/browse/MESOS-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1355: - Labels: coverity security (was: coverity) Use of untrusted string value in jvm.cpp Key: MESOS-1355 URL: https://issues.apache.org/jira/browse/MESOS-1355 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Labels: coverity, security *** CID 1213892: Use of untrusted string value (TAINTED_STRING) /src/jvm/jvm.cpp: 66 in Jvm::create(const std::vectorstd::basic_stringchar, std::char_traitschar, std::allocatorchar, std::allocatorstd::basic_stringchar, std::char_traitschar, std::allocatorchar , JNI::Version, bool)() 60 std::string libJvmPath = os::getenv(JAVA_JVM_LIBRARY, false); 61 62 if (libJvmPath.empty()) { 63 libJvmPath = mesos::internal::build::JAVA_JVM_LIBRARY; 64 } 65 CID 1213892: Use of untrusted string value (TAINTED_STRING) Passing tainted string libJvmPath.c_str() to dlopen(char const *, int), which cannot accept tainted data. 66 void* handle = dlopen(libJvmPath.c_str(), RTLD_NOW); 67 68 if (handle == NULL) { 69 return Error(dlerror()); 70 } 71 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1353) Operands don't affect result in mesos_containerizer.cpp
[ https://issues.apache.org/jira/browse/MESOS-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1353: - Labels: coverity newbie (was: coverity) Operands don't affect result in mesos_containerizer.cpp --- Key: MESOS-1353 URL: https://issues.apache.org/jira/browse/MESOS-1353 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Priority: Minor Labels: coverity, newbie May be a false positive - should be investigated. *** CID 1213887: Operands don't affect result (CONSTANT_EXPRESSION_RESULT) /src/slave/containerizer/mesos_containerizer.cpp: 416 in mesos::internal::slave::execute(const mesos::CommandInfo , const std::basic_stringchar, std::char_traitschar, std::allocatorchar, const Optionstd::basic_stringchar, std::char_traitschar, std::allocatorchar , const std::mapstd::basic_stringchar, std::char_traitschar, std::allocatorchar, std::basic_stringchar, std::char_traitschar, std::allocatorchar, std::lessstd::basic_stringchar, std::char_traitschar, std::allocatorchar, std::allocatorstd::pairconst std::basic_stringchar, std::char_traitschar, std::allocatorchar, std::basic_stringchar, std::char_traitschar, std::allocatorchar , bool, int, int, const std::listOptionmesos::CommandInfo, std::allocatorOptionmesos::CommandInfo )() 410 if (chown.isError()) { 411 ABORT(Failed to chown work directory); 412 } 413 } 414 415 // Enter working directory. CID 1213887: Operands don't affect result (CONSTANT_EXPRESSION_RESULT) chdir(directory) 0 is always false regardless of the values of its operands. This occurs as the logical operand of if. 416 if (os::chdir(directory) 0) { 417 ABORT(Failed to chdir into work directory); 418 } 419 420 // Redirect output to files in working dir if required. We append because 421 // others (e.g., mesos-fetcher) may have already logged to the files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1194) protobuf-JSON rendering doesnt validate
[ https://issues.apache.org/jira/browse/MESOS-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1194: - Labels: json newbie protobuf stout (was: json protobuf stout) protobuf-JSON rendering doesnt validate --- Key: MESOS-1194 URL: https://issues.apache.org/jira/browse/MESOS-1194 Project: Mesos Issue Type: Bug Components: stout Affects Versions: 0.19.0 Reporter: Till Toenshoff Priority: Minor Labels: json, newbie, protobuf, stout When using JSON::Protobuf(Message), the supplied protobuf is not checked for being properly initialized, hence e.g. required fields could be missing. It would be desirable to have a feedback mechanism in place for this constructor - maybe this would do: {noformat} if (!message.IsInitialized()) { std::cerr Protobuf not initialized: message.InitializationErrorString() std::endl; abort(); } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-932) building spark0.8.1 with mesos 0.15.0 error becauseof protobuf is not compatible
[ https://issues.apache.org/jira/browse/MESOS-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon closed MESOS-932. --- Resolution: Fixed building spark0.8.1 with mesos 0.15.0 error becauseof protobuf is not compatible Key: MESOS-932 URL: https://issues.apache.org/jira/browse/MESOS-932 Project: Mesos Issue Type: Bug Components: libprocess Affects Versions: 0.15.0 Environment: hadoop2.2.0+spark0.8.0+mesos0.15.0 Reporter: liuhanbing Labels: test Original Estimate: 96h Remaining Estimate: 96h I've tried building spark0.8.1 with mesos 0.15.0 and I have the exact error: Stack: [0x7f4cd1734000,0x7f4cd1835000], sp=0x7f4cd1833490, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x514706] unsigned+0xb6 j org.apache.mesos.MesosSchedulerDriver.initialize()V+0 j org.apache.mesos.MesosSchedulerDriver.init(Lorg/apache/mesos/Scheduler;Lorg/apache/mesos/Protos$FrameworkInfo;Ljava/lang/String;)V+62 j org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend$$anon$1.run()V+44 v ~StubRoutines::call_stub Java Threads: ( = current thread ) =0x7f4cd4186800 JavaThread MesosSchedulerBackend driver daemon [_thread_in_vm, id=19685, stack(0x7f4cd1734000,0x7f4cd1835000)] 0x7f4cd4170800 JavaThread Timer-0 daemon [_thread_blocked, id=19684, stack(0x7f4cd1835000,0x7f4cd1936000)] 0x7f4cd4077000 JavaThread qtp1645300503-49 daemon [_thread_blocked, id=19683, stack(0x7f4cd33f4000,0x7f4cd34f5000)] 0x7f4cd4075000 JavaThread qtp1645300503-48 daemon [_thread_blocked, id=19682, stack(0x7f4cd34f5000,0x7f4cd35f6000)] It looks like the protobuf is not compatible. spark0.8.1 has protobuf2.5.0 but mesos0.15.0 still has protobuf2.4.1. I've rebuild mesos 0.15.0 using protobuf 2.5.0.The error I am having still seems to be related to protobuf seriously do not know how to try to debug that. any idears? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-719) missing-call-to-setgroups
[ https://issues.apache.org/jira/browse/MESOS-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205217#comment-14205217 ] Dominic Hamon commented on MESOS-719: - [~Ken Simon] [~tstclair] Can we get this patch uploaded to RB please? missing-call-to-setgroups - Key: MESOS-719 URL: https://issues.apache.org/jira/browse/MESOS-719 Project: Mesos Issue Type: Bug Components: general Affects Versions: 0.15.0 Reporter: Timothy St. Clair Labels: newbie Attachments: MESOS-719-0.20.1.patch This traces into stout/os.hpp in vetting the code as part of fedora packaging, rpmlint outputs an error around priv-changing . mesos.x86_64: E: missing-call-to-setgroups /usr/lib64/libmesos-0.15.0.so.0.0.0 https://www.securecoding.cert.org/confluence/display/seccode/POS36-C.+Observe+correct+revocation+order+while+relinquishing+privileges -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-698) Implement requestResources() feature in the master
[ https://issues.apache.org/jira/browse/MESOS-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-698: Assignee: Vinod Kone Implement requestResources() feature in the master -- Key: MESOS-698 URL: https://issues.apache.org/jira/browse/MESOS-698 Project: Mesos Issue Type: Improvement Components: framework, master, slave Affects Versions: 0.12.0 Reporter: Sam Taha Assignee: Vinod Kone Labels: features Scheduler.requestResources() does not remove filters on slaves. Looking at the code it doesn't look like the requestResources() feature is implemented in the master. It is currently a no-op on the master/allocator. The Request protobuf does have an optional slaveId which can be used to perhaps remove any filters on the slave to allow offers to flow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-713) Support for adding subsystems to existing cgroup hierarchies.
[ https://issues.apache.org/jira/browse/MESOS-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-713: Labels: newbie twitter (was: newbie) Support for adding subsystems to existing cgroup hierarchies. - Key: MESOS-713 URL: https://issues.apache.org/jira/browse/MESOS-713 Project: Mesos Issue Type: Improvement Components: isolation Reporter: Benjamin Mahler Priority: Minor Labels: newbie, twitter Currently if a slave is restarted with additional subsystems, it will refuse to proceed if those subsystems are not attached to the existing hierarchy. It's possible to add subsystems to existing hierarchies via re-mounting: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-Attaching_Subsystems_to_and_Detaching_Them_From_an_Existing_Hierarchy.html We can add support for this by calling mount with the MS_REMOUNT option. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.
[ https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-391: Labels: newbie twitter (was: newbie) Slave GarbageCollector needs to also take into account the number of links, when determining removal time. -- Key: MESOS-391 URL: https://issues.apache.org/jira/browse/MESOS-391 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Labels: newbie, twitter The slave garbage collector does not take into account the number of links present, which means that if we create a lot of executor directories (up to LINK_MAX), we won't necessarily GC. As a result of this, the slave crashes: F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed to create executor directory '/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e': Too many links *** Check failure stack trace: *** @ 0x7f9320f82f9d google::LogMessage::Fail() @ 0x7f9320f88c07 google::LogMessage::SendToLog() @ 0x7f9320f8484c google::LogMessage::Flush() @ 0x7f9320f84ab6 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9320c70312 _CheckSome::~_CheckSome() @ 0x7f9320c9dd5c mesos::internal::slave::paths::createExecutorDirectory() @ 0x7f9320c9e60d mesos::internal::slave::Framework::createExecutor() @ 0x7f9320c7a7f7 mesos::internal::slave::Slave::runTask() @ 0x7f9320c9cb43 ProtobufProcess::handler4() @ 0x7f9320c8678b std::tr1::_Function_handler::_M_invoke() @ 0x7f9320c9d1ab ProtobufProcess::visit() @ 0x7f9320e4c774 process::MessageEvent::visit() @ 0x7f9320e40a1d process::ProcessManager::resume() @ 0x7f9320e41268 process::schedule() @ 0x7f932055973d start_thread @ 0x7f931ef3df6d clone The fix here is to take into account the number of links (st_nlinks), when determining whether we need to GC. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-723) Expose total number of resources allocated to the slave in its endpoint
[ https://issues.apache.org/jira/browse/MESOS-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-723: - Fix Version/s: (was: 0.22.0) 0.21.0 Expose total number of resources allocated to the slave in its endpoint --- Key: MESOS-723 URL: https://issues.apache.org/jira/browse/MESOS-723 Project: Mesos Issue Type: Improvement Reporter: Vinod Kone Assignee: Vinod Kone Labels: twitter Fix For: 0.21.0 This could be useful information if there are bugs in master/slave that causes slaves to overcommit its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2052) RunState::recover should always recover 'completed'
[ https://issues.apache.org/jira/browse/MESOS-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-2052: -- Fix Version/s: 0.21.0 RunState::recover should always recover 'completed' --- Key: MESOS-2052 URL: https://issues.apache.org/jira/browse/MESOS-2052 Project: Mesos Issue Type: Bug Components: containerization, slave Affects Versions: 0.20.0 Reporter: Ian Downes Assignee: Ian Downes Fix For: 0.21.0 RunState::recover() will return partial state if it cannot find or open the libprocess pid file. Specifically, it does not recover the 'completed' flag. However, if the slave has removed the executor (because launch failed or the executor failed to register) the sentinel flag will be set and this fact should be recovered. This ensures that container recovery is not attempted later. This was discovered when the LinuxLauncher failed to recover because it was asked to recover two containers with the same forkedPid. Investigation showed the executors both OOM'ed before registering, i.e., no libprocess pid file was present. However, the containerizer had detected the OOM, destroyed the container, and notified the slave which cleaned everything up: failing the task and calling removeExecutor (which writes the completed sentinel file.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2046) Configure should check headers and libraries for svn and apr
[ https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen updated MESOS-2046: Summary: Configure should check headers and libraries for svn and apr (was: Configure should check headers instead of libraries for svn and apr) Configure should check headers and libraries for svn and apr Key: MESOS-2046 URL: https://issues.apache.org/jira/browse/MESOS-2046 Project: Mesos Issue Type: Bug Reporter: Timothy Chen Assignee: Timothy Chen For dependencies that we include headers, we need to check for the include headers, which in configure we need to call AC_CHECK_HEADER instead of AC_CHECK_LIB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2046) Configure should check headers and libraries for svn and apr
[ https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205254#comment-14205254 ] Timothy Chen commented on MESOS-2046: - commit 42458bf2ad7fde97b18b167ba79b67effe43b0e9 Author: Timothy Chen tnac...@gmail.com Date: Thu Nov 6 14:01:33 2014 -0800 Added check for apr and svn headers besides libraries Review: https://reviews.apache.org/r/27701 commit f8c89e8eda098be9c9a592e53edadc457796d3de Author: Timothy Chen tnac...@gmail.com Date: Thu Nov 6 14:01:43 2014 -0800 Added check for apr and svn headers besides libraries in 3rdparty Review: https://reviews.apache.org/r/27704 Configure should check headers and libraries for svn and apr Key: MESOS-2046 URL: https://issues.apache.org/jira/browse/MESOS-2046 Project: Mesos Issue Type: Bug Reporter: Timothy Chen Assignee: Timothy Chen For dependencies that we include headers, we need to check for the include headers, which in configure we need to call AC_CHECK_HEADER instead of AC_CHECK_LIB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-2046) Configure should check headers and libraries for svn and apr
[ https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen closed MESOS-2046. --- Configure should check headers and libraries for svn and apr Key: MESOS-2046 URL: https://issues.apache.org/jira/browse/MESOS-2046 Project: Mesos Issue Type: Bug Reporter: Timothy Chen Assignee: Timothy Chen For dependencies that we include headers, we need to check for the include headers, which in configure we need to call AC_CHECK_HEADER instead of AC_CHECK_LIB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (MESOS-2046) Configure should check headers and libraries for svn and apr
[ https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes reopened MESOS-2046: --- Configure should check headers and libraries for svn and apr Key: MESOS-2046 URL: https://issues.apache.org/jira/browse/MESOS-2046 Project: Mesos Issue Type: Bug Reporter: Timothy Chen Assignee: Timothy Chen Fix For: 0.21.0 For dependencies that we include headers, we need to check for the include headers, which in configure we need to call AC_CHECK_HEADER instead of AC_CHECK_LIB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2046) Configure should check headers and libraries for svn and apr
[ https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-2046: -- Fix Version/s: 0.21.0 Configure should check headers and libraries for svn and apr Key: MESOS-2046 URL: https://issues.apache.org/jira/browse/MESOS-2046 Project: Mesos Issue Type: Bug Reporter: Timothy Chen Assignee: Timothy Chen Fix For: 0.21.0 For dependencies that we include headers, we need to check for the include headers, which in configure we need to call AC_CHECK_HEADER instead of AC_CHECK_LIB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-2046) Configure should check headers and libraries for svn and apr
[ https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes resolved MESOS-2046. --- Resolution: Fixed Configure should check headers and libraries for svn and apr Key: MESOS-2046 URL: https://issues.apache.org/jira/browse/MESOS-2046 Project: Mesos Issue Type: Bug Reporter: Timothy Chen Assignee: Timothy Chen Fix For: 0.21.0 For dependencies that we include headers, we need to check for the include headers, which in configure we need to call AC_CHECK_HEADER instead of AC_CHECK_LIB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1694) Future::failure should return a const string
[ https://issues.apache.org/jira/browse/MESOS-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205289#comment-14205289 ] Dominic Hamon commented on MESOS-1694: -- https://reviews.apache.org/r/27826 Future::failure should return a const string - Key: MESOS-1694 URL: https://issues.apache.org/jira/browse/MESOS-1694 Project: Mesos Issue Type: Task Components: technical debt Reporter: Dominic Hamon Assignee: Dominic Hamon Priority: Minor Labels: newbie -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave
Dominic Hamon created MESOS-2058: Summary: Deprecate stats.json endpoints for Master and Slave Key: MESOS-2058 URL: https://issues.apache.org/jira/browse/MESOS-2058 Project: Mesos Issue Type: Task Components: master, slave Reporter: Dominic Hamon With the introduction of the libprocess {{/metrics/snapshot}} endpoint, metrics are now duplicated in the Master and Slave between this and {{stats.json}}. We should deprecate the {{stats.json}} endpoints. Manual inspection of {{stats.json}} shows that all metrics are now covered by the new endpoint for Master and Slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2039) Create a Design Doc for Dynamic Reservations
[ https://issues.apache.org/jira/browse/MESOS-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-2039: Description: A design doc to be shared with the community for the Dynamic Reservation epic. Update: Shared [dynamic reservation|https://docs.google.com/document/d/1e3j69pfBgtc8xM00DhcuiMl6ImkEB5na0TzOMyzrg8A/edit?usp=sharing] with the dev mailing list. was: A design doc to be shared with the community for the Dynamic Reservation epic. Update: Shared [Design Doc|https://docs.google.com/document/d/1e3j69pfBgtc8xM00DhcuiMl6ImkEB5na0TzOMyzrg8A/edit?usp=sharing] with the dev mailing list. Create a Design Doc for Dynamic Reservations Key: MESOS-2039 URL: https://issues.apache.org/jira/browse/MESOS-2039 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Michael Park Assignee: Michael Park A design doc to be shared with the community for the Dynamic Reservation epic. Update: Shared [dynamic reservation|https://docs.google.com/document/d/1e3j69pfBgtc8xM00DhcuiMl6ImkEB5na0TzOMyzrg8A/edit?usp=sharing] with the dev mailing list. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2059) improve performance of expensive tests
Dominic Hamon created MESOS-2059: Summary: improve performance of expensive tests Key: MESOS-2059 URL: https://issues.apache.org/jira/browse/MESOS-2059 Project: Mesos Issue Type: Improvement Components: technical debt, test Reporter: Dominic Hamon Priority: Minor Many of our tests take a long time to run which has an impact on the developer compile-test cycle. Improving the performance of the worst cases can lead to a significant improvement in developer workflow. A quick test shows that focusing on a few key test fixtures might be worthwhile: {noformat} $ egrep '\(.* ms\)$' test.log | cut -d\ -f10- | cut -d\ -f1-2 | sed 's/(//' | sort -k2 -rn | head -n 30 ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors 15107 ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireMasterZKSession 13473 ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster 13434 ZooKeeperMasterContenderDetectorTest.MasterContenders 10089 ZooKeeperMasterContenderDetectorTest.MasterDetectorTimedoutSession 10081 ZooKeeperMasterContenderDetectorTest.ContenderDetectorShutdownNetwork 8459 ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSession 8424 ZooKeeperMasterContenderDetectorTest.MasterContender 8397 SlaveRecoveryTest/0.MultipleFrameworks 7971 ExamplesTest.PythonFramework 7326 HealthCheckTest.GracePeriod 6552 SlaveRecoveryTest/0.ReconcileKillTask 6150 ExamplesTest.LowLevelSchedulerPthread 6113 ExamplesTest.JavaFramework 5543 ExamplesTest.NoExecutorFramework 5391 ExamplesTest.TestFramework 5282 ExamplesTest.LowLevelSchedulerLibprocess 5282 ExamplesTest.JavaException 5177 ZooKeeperMasterContenderDetectorTest.ContenderPendingElection 5046 BasicMasterContenderDetectorTest.Detector 5010 BasicMasterContenderDetectorTest.Contender 5004 SlaveRecoveryTest/0.MultipleSlaves 4845 SlaveRecoveryTest/0.MasterFailover 4736 SlaveRecoveryTest/0.ShutdownSlave 4517 SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1 4482 SlaveRecoveryTest/0.Reboot 4481 SlaveRecoveryTest/0.KillTask 3600 SlaveRecoveryTest/0.SchedulerFailover 3542 SlaveRecoveryTest/0.ReconcileShutdownFramework 3534 GroupTest.GroupJoinWithDisconnect 3384 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1694) Future::failure should return a const string
[ https://issues.apache.org/jira/browse/MESOS-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205461#comment-14205461 ] Dominic Hamon commented on MESOS-1694: -- https://reviews.apache.org/r/27831/ introduces ABORT if string is not set which should simplify this patch. Future::failure should return a const string - Key: MESOS-1694 URL: https://issues.apache.org/jira/browse/MESOS-1694 Project: Mesos Issue Type: Task Components: technical debt Reporter: Dominic Hamon Assignee: Dominic Hamon Priority: Minor Labels: newbie -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1694) Future::failure should return a const string
[ https://issues.apache.org/jira/browse/MESOS-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1694: - Sprint: Twitter Mesos Q4 Sprint 3 Future::failure should return a const string - Key: MESOS-1694 URL: https://issues.apache.org/jira/browse/MESOS-1694 Project: Mesos Issue Type: Task Components: technical debt Reporter: Dominic Hamon Assignee: Dominic Hamon Priority: Minor Labels: newbie -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2059) improve performance of expensive tests
[ https://issues.apache.org/jira/browse/MESOS-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205489#comment-14205489 ] Benjamin Mahler commented on MESOS-2059: Linking in MESOS-1757 (an epic for speeding up tests), where I included a few starting suggestions and was hoping to have them broken off under the epic. improve performance of expensive tests -- Key: MESOS-2059 URL: https://issues.apache.org/jira/browse/MESOS-2059 Project: Mesos Issue Type: Improvement Components: technical debt, test Reporter: Dominic Hamon Priority: Minor Many of our tests take a long time to run which has an impact on the developer compile-test cycle. Improving the performance of the worst cases can lead to a significant improvement in developer workflow. A quick test shows that focusing on a few key test fixtures might be worthwhile: {noformat} $ egrep '\(.* ms\)$' test.log | cut -d\ -f10- | cut -d\ -f1-2 | sed 's/(//' | sort -k2 -rn | head -n 30 ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors 15107 ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireMasterZKSession 13473 ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster 13434 ZooKeeperMasterContenderDetectorTest.MasterContenders 10089 ZooKeeperMasterContenderDetectorTest.MasterDetectorTimedoutSession 10081 ZooKeeperMasterContenderDetectorTest.ContenderDetectorShutdownNetwork 8459 ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSession 8424 ZooKeeperMasterContenderDetectorTest.MasterContender 8397 SlaveRecoveryTest/0.MultipleFrameworks 7971 ExamplesTest.PythonFramework 7326 HealthCheckTest.GracePeriod 6552 SlaveRecoveryTest/0.ReconcileKillTask 6150 ExamplesTest.LowLevelSchedulerPthread 6113 ExamplesTest.JavaFramework 5543 ExamplesTest.NoExecutorFramework 5391 ExamplesTest.TestFramework 5282 ExamplesTest.LowLevelSchedulerLibprocess 5282 ExamplesTest.JavaException 5177 ZooKeeperMasterContenderDetectorTest.ContenderPendingElection 5046 BasicMasterContenderDetectorTest.Detector 5010 BasicMasterContenderDetectorTest.Contender 5004 SlaveRecoveryTest/0.MultipleSlaves 4845 SlaveRecoveryTest/0.MasterFailover 4736 SlaveRecoveryTest/0.ShutdownSlave 4517 SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1 4482 SlaveRecoveryTest/0.Reboot 4481 SlaveRecoveryTest/0.KillTask 3600 SlaveRecoveryTest/0.SchedulerFailover 3542 SlaveRecoveryTest/0.ReconcileShutdownFramework 3534 GroupTest.GroupJoinWithDisconnect 3384 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1474) Provide cluster maintenance primitives for operators.
[ https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1474: --- Description: Sometimes operators need to perform maintenance on a mesos cluster; we define maintenance here as anything that requires the tasks to be drained on the slave(s). Most mesos upgrades can be done without affecting running tasks, but there are situations where maintenance is task-affecting: * Host maintenance (e.g. hardware repair, kernel upgrades). * Non-recoverable slave upgrades (e.g. adjusting slave attributes). * etc In order to ensure operators don’t violate frameworks’ SLAs, schedulers need to be aware of planned unavailability events. Maintenance awareness allows schedulers to avoid churn for long running tasks by placing them on machines not undergoing maintenance. If all resources are planned for maintenance, then the scheduler will prefer machines scheduled for maintenance least imminently. Maintenance awareness is also crucial when a scheduler uses [persistent disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure that the scheduler is aware of the expected duration of unavailability for a persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 1TB over the network when only 1 of the 3 replicas is going to be unavailable for a reboot ( 1 hour)). There are a few primitives of interest here: * Provide a way for operators to [fully shutdown a slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks underneath it). Colloquially known as a hard drain. * Provide a way for operators to mark specific slaves as scheduled for maintenance. This will inform the scheduler about the scheduled unavailability of the resources. * Provide a way for frameworks to be notified when resources are requested to be relinquished. This gives the framework to proactively move a task before it may be forcibly killed by an operator. It also allows the automation of operations like: please drain these slaves within 1 hour. See the [design doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#] for the latest details. was: Normally cluster upgrades can be done seamlessly using the built-in slave recovery feature. However, there are situations where operators want to be able to perform destructive maintenance operations on machines: * Non-recoverable slave upgrades. * Machine reboots. * Kernel upgrades. * Machine decommissioning. * etc. In these situations, best practice is to perform rolling maintenance in large batches of machines. This can be problematic for frameworks when many related tasks are located within a batch of machines going for maintenance. There are a few primitives of interest here: * Provide a way for operators to fully shutdown a slave (killing all tasks underneath it). * Provide a way for operators to mark specific slaves as undergoing maintenance. This means that no more offers are being sent for these slaves, and no new tasks will launch on them. * Provide a way for frameworks to be notified when resources are requested to be relinquished. This gives the framework to proactively move a task before it is forcibly killed. It also allows the automation of operations like: please drain these slaves within 1 hour. Provide cluster maintenance primitives for operators. - Key: MESOS-1474 URL: https://issues.apache.org/jira/browse/MESOS-1474 Project: Mesos Issue Type: Epic Components: framework, master, slave Reporter: Benjamin Mahler Sometimes operators need to perform maintenance on a mesos cluster; we define maintenance here as anything that requires the tasks to be drained on the slave(s). Most mesos upgrades can be done without affecting running tasks, but there are situations where maintenance is task-affecting: * Host maintenance (e.g. hardware repair, kernel upgrades). * Non-recoverable slave upgrades (e.g. adjusting slave attributes). * etc In order to ensure operators don’t violate frameworks’ SLAs, schedulers need to be aware of planned unavailability events. Maintenance awareness allows schedulers to avoid churn for long running tasks by placing them on machines not undergoing maintenance. If all resources are planned for maintenance, then the scheduler will prefer machines scheduled for maintenance least imminently. Maintenance awareness is also crucial when a scheduler uses [persistent disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure that the scheduler is aware of the expected duration of unavailability for a persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 1TB over the network when only 1 of the 3 replicas is going to
[jira] [Updated] (MESOS-1470) Add operational documentation for running HA masters.
[ https://issues.apache.org/jira/browse/MESOS-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-1470: --- Description: Now that the master has replicated state on the disk, we should add documentation that guides operators for common maintenance work: * Swapping a master in the ensemble. * Growing the master ensemble. * Shrinking the master ensemble. This would help craft similar documentation for users of the replicated log. was: Now that the master has replicated state on the disk, we should add documentation that guides operators for common maintenance work: * Swapping a master in the ensemble. * Growing the master ensemble. * Shrinking the master ensemble. This would help craft similar documentation for users of the replicated log. We should also add documentation for existing slave maintenance documentation: * Best practices for rolling upgrades. * How to shut down a slave. This latter category will be incorporated with [~alexandra.sava]'s maintenance work! Affects Version/s: (was: 0.19.0) Summary: Add operational documentation for running HA masters. (was: Add cluster maintenance documentation.) Add operational documentation for running HA masters. - Key: MESOS-1470 URL: https://issues.apache.org/jira/browse/MESOS-1470 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Benjamin Mahler Labels: twitter Now that the master has replicated state on the disk, we should add documentation that guides operators for common maintenance work: * Swapping a master in the ensemble. * Growing the master ensemble. * Shrinking the master ensemble. This would help craft similar documentation for users of the replicated log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2060) Add support for 'hooks' in task launch sequence
Niklas Quarfot Nielsen created MESOS-2060: - Summary: Add support for 'hooks' in task launch sequence Key: MESOS-2060 URL: https://issues.apache.org/jira/browse/MESOS-2060 Project: Mesos Issue Type: Task Components: modules Reporter: Niklas Quarfot Nielsen Similar to Apache Modules, hooks allows module writers to tie into internal components which may not be suitable to be abstracted entirely behind modules but rather let's them define actions on so-called hooks (http://httpd.apache.org/docs/2.2/developer/hooks.html). In Apache Web Server, this lets people tie into the request processing cycle, in Mesos one place interesting place to start could be pre and post actions for master and slave task launch sequence. Examples could be external statistics/metrics gathering, security infrastructure etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2061) Add InverseOffer protobuf message.
Benjamin Mahler created MESOS-2061: -- Summary: Add InverseOffer protobuf message. Key: MESOS-2061 URL: https://issues.apache.org/jira/browse/MESOS-2061 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler InverseOffer was defined as part of the maintenance work in MESOS-1474, design doc here: https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing This ticket is to capture the addition of the InverseOffer protobuf to mesos.proto, the necessary API changes for Event/Call and the language bindings will be tracked separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2062) Add InverseOffer to Event/Call API.
Benjamin Mahler created MESOS-2062: -- Summary: Add InverseOffer to Event/Call API. Key: MESOS-2062 URL: https://issues.apache.org/jira/browse/MESOS-2062 Project: Mesos Issue Type: Task Components: c++ api Reporter: Benjamin Mahler The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add this is to tack it on to the OFFERS Event: {code} message Offers { repeated Offer offers = 1; repeated InverseOffer inverse_offers = 2; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2064) Add InverseOffer to Java Scheduler API.
Benjamin Mahler created MESOS-2064: -- Summary: Add InverseOffer to Java Scheduler API. Key: MESOS-2064 URL: https://issues.apache.org/jira/browse/MESOS-2064 Project: Mesos Issue Type: Task Components: java api Reporter: Benjamin Mahler The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Java Scheduler API is to add a new callback: {code} void inverseResourceOffers( SchedulerDriver driver, ListInverseOffer inverseOffers); {code} JAR / libmesos compatibility will need to be figured out here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2065) Add InverseOffer to Python Scheduler API.
Benjamin Mahler created MESOS-2065: -- Summary: Add InverseOffer to Python Scheduler API. Key: MESOS-2065 URL: https://issues.apache.org/jira/browse/MESOS-2065 Project: Mesos Issue Type: Task Components: python api Reporter: Benjamin Mahler The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Python Scheduler API is to add a new callback: {code} def inverseResourceOffers(self, driver, inverse_offers): {code} Egg / libmesos compatibility will need to be figured out here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2064) Add InverseOffer to Java Scheduler API.
[ https://issues.apache.org/jira/browse/MESOS-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2064: --- Description: The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Java Scheduler API is to add a new callback: {code} void inverseResourceOffers( SchedulerDriver driver, ListInverseOffer inverseOffers); {code} JAR / libmesos compatibility will need to be figured out here. We may want to leave the Java binding untouched in favor of Event/Call, in order to not break API compatibility for schedulers. was: The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Java Scheduler API is to add a new callback: {code} void inverseResourceOffers( SchedulerDriver driver, ListInverseOffer inverseOffers); {code} JAR / libmesos compatibility will need to be figured out here. Add InverseOffer to Java Scheduler API. --- Key: MESOS-2064 URL: https://issues.apache.org/jira/browse/MESOS-2064 Project: Mesos Issue Type: Task Components: java api Reporter: Benjamin Mahler The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Java Scheduler API is to add a new callback: {code} void inverseResourceOffers( SchedulerDriver driver, ListInverseOffer inverseOffers); {code} JAR / libmesos compatibility will need to be figured out here. We may want to leave the Java binding untouched in favor of Event/Call, in order to not break API compatibility for schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2063) Add InverseOffer to C++ Scheduler API.
[ https://issues.apache.org/jira/browse/MESOS-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2063: --- Description: The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the C++ Scheduler API is to add a new callback: {code} virtual void inverseResourceOffers( SchedulerDriver* driver, const std::vectorInverseOffer inverseOffers) = 0; {code} libmesos compatibility will need to be figured out here. We may want to leave the C++ binding untouched in favor of Event/Call, in order to not break API compatibility for schedulers. was: The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the C++ Scheduler API is to add a new callback: {code} virtual void inverseResourceOffers( SchedulerDriver* driver, const std::vectorInverseOffer inverseOffers) = 0; {code} libmesos compatibility will need to be figured out here. Add InverseOffer to C++ Scheduler API. -- Key: MESOS-2063 URL: https://issues.apache.org/jira/browse/MESOS-2063 Project: Mesos Issue Type: Task Components: c++ api Reporter: Benjamin Mahler The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the C++ Scheduler API is to add a new callback: {code} virtual void inverseResourceOffers( SchedulerDriver* driver, const std::vectorInverseOffer inverseOffers) = 0; {code} libmesos compatibility will need to be figured out here. We may want to leave the C++ binding untouched in favor of Event/Call, in order to not break API compatibility for schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2065) Add InverseOffer to Python Scheduler API.
[ https://issues.apache.org/jira/browse/MESOS-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2065: --- Description: The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Python Scheduler API is to add a new callback: {code} def inverseResourceOffers(self, driver, inverse_offers): {code} Egg / libmesos compatibility will need to be figured out here. We may want to leave the Python binding untouched in favor of Event/Call, in order to not break API compatibility for schedulers. was: The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Python Scheduler API is to add a new callback: {code} def inverseResourceOffers(self, driver, inverse_offers): {code} Egg / libmesos compatibility will need to be figured out here. Add InverseOffer to Python Scheduler API. - Key: MESOS-2065 URL: https://issues.apache.org/jira/browse/MESOS-2065 Project: Mesos Issue Type: Task Components: python api Reporter: Benjamin Mahler The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Python Scheduler API is to add a new callback: {code} def inverseResourceOffers(self, driver, inverse_offers): {code} Egg / libmesos compatibility will need to be figured out here. We may want to leave the Python binding untouched in favor of Event/Call, in order to not break API compatibility for schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2061) Add InverseOffer protobuf message.
[ https://issues.apache.org/jira/browse/MESOS-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2061: --- Description: InverseOffer was defined as part of the maintenance work in MESOS-1474, design doc here: https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing {code} // A request to deallocate or return any resources already // being consumed by the framework. message InverseOffer { required OfferID id = 1; required FrameworkID framework_id = 2; repeated Resource resources = 3; // The slave ID if the resources need to be released on a particular slave. optional SlaveID slave_id = 4; // The executor and task IDs if the resources need to be released on specific // executors and/or tasks. optional ExecutorID executor_id = 6; repeated TaskID task_ids = 6; // The resources specified in this offer will become unavailable // at the specified start time and for the specified duration. Any // tasks running using these resources might get killed when // these resources become unavailable. required Unavailability unavailability = 7; } {code} This ticket is to capture the addition of the InverseOffer protobuf to mesos.proto, the necessary API changes for Event/Call and the language bindings will be tracked separately. was: InverseOffer was defined as part of the maintenance work in MESOS-1474, design doc here: https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing This ticket is to capture the addition of the InverseOffer protobuf to mesos.proto, the necessary API changes for Event/Call and the language bindings will be tracked separately. Add InverseOffer protobuf message. -- Key: MESOS-2061 URL: https://issues.apache.org/jira/browse/MESOS-2061 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler Labels: twitter InverseOffer was defined as part of the maintenance work in MESOS-1474, design doc here: https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing {code} // A request to deallocate or return any resources already // being consumed by the framework. message InverseOffer { required OfferID id = 1; required FrameworkID framework_id = 2; repeated Resource resources = 3; // The slave ID if the resources need to be released on a particular slave. optional SlaveID slave_id = 4; // The executor and task IDs if the resources need to be released on specific // executors and/or tasks. optional ExecutorID executor_id = 6; repeated TaskID task_ids = 6; // The resources specified in this offer will become unavailable // at the specified start time and for the specified duration. Any // tasks running using these resources might get killed when // these resources become unavailable. required Unavailability unavailability = 7; } {code} This ticket is to capture the addition of the InverseOffer protobuf to mesos.proto, the necessary API changes for Event/Call and the language bindings will be tracked separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2065) Add InverseOffer to Python Scheduler API.
[ https://issues.apache.org/jira/browse/MESOS-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2065: --- Labels: twitter (was: ) Add InverseOffer to Python Scheduler API. - Key: MESOS-2065 URL: https://issues.apache.org/jira/browse/MESOS-2065 Project: Mesos Issue Type: Task Components: python api Reporter: Benjamin Mahler Labels: twitter The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Python Scheduler API is to add a new callback: {code} def inverseResourceOffers(self, driver, inverse_offers): {code} Egg / libmesos compatibility will need to be figured out here. We may want to leave the Python binding untouched in favor of Event/Call, in order to not break API compatibility for schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2066) Add optional 'Unavailability' to resource offers to provide maintenance awareness.
Benjamin Mahler created MESOS-2066: -- Summary: Add optional 'Unavailability' to resource offers to provide maintenance awareness. Key: MESOS-2066 URL: https://issues.apache.org/jira/browse/MESOS-2066 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler In order to inform frameworks about upcoming maintenance on offered resources, per MESOS-1474, we'd like to add an optional 'Unavailability' information to offers: {code} message Unavailability { required Time start = 1; // The approximate duration of the unavailability, // if this is a transient unavailability. optional Duration duration = 2; } message Offer { required OfferID id = 1; required FrameworkID framework_id = 2; required SlaveID slave_id = 3; required string hostname = 4; repeated Resource resources = 5; repeated Attribute attributes = 7; repeated ExecutorID executor_ids = 6; // The resources specified in this offer will become unavailable // at the specified start time and for the specified duration. Any // tasks launched using these resources might get killed when // these resources become unavailable. optional Unavailability unavailability = 8; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2066) Add optional 'Unavailability' to resource offers to provide maintenance awareness.
[ https://issues.apache.org/jira/browse/MESOS-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2066: --- Labels: twitter (was: ) Add optional 'Unavailability' to resource offers to provide maintenance awareness. -- Key: MESOS-2066 URL: https://issues.apache.org/jira/browse/MESOS-2066 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler Labels: twitter In order to inform frameworks about upcoming maintenance on offered resources, per MESOS-1474, we'd like to add an optional 'Unavailability' information to offers: {code} message Unavailability { required Time start = 1; // The approximate duration of the unavailability, // if this is a transient unavailability. optional Duration duration = 2; } message Offer { required OfferID id = 1; required FrameworkID framework_id = 2; required SlaveID slave_id = 3; required string hostname = 4; repeated Resource resources = 5; repeated Attribute attributes = 7; repeated ExecutorID executor_ids = 6; // The resources specified in this offer will become unavailable // at the specified start time and for the specified duration. Any // tasks launched using these resources might get killed when // these resources become unavailable. optional Unavailability unavailability = 8; } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2064) Add InverseOffer to Java Scheduler API.
[ https://issues.apache.org/jira/browse/MESOS-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-2064: --- Labels: twitter (was: ) Add InverseOffer to Java Scheduler API. --- Key: MESOS-2064 URL: https://issues.apache.org/jira/browse/MESOS-2064 Project: Mesos Issue Type: Task Components: java api Reporter: Benjamin Mahler Labels: twitter The initial use case for InverseOffer in the framework API will be the maintenance primitives in mesos: MESOS-1474. One way to add these to the Java Scheduler API is to add a new callback: {code} void inverseResourceOffers( SchedulerDriver driver, ListInverseOffer inverseOffers); {code} JAR / libmesos compatibility will need to be figured out here. We may want to leave the Java binding untouched in favor of Event/Call, in order to not break API compatibility for schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2067) Add HTTP API to the master for maintenance operations.
Benjamin Mahler created MESOS-2067: -- Summary: Add HTTP API to the master for maintenance operations. Key: MESOS-2067 URL: https://issues.apache.org/jira/browse/MESOS-2067 Project: Mesos Issue Type: Task Components: master Reporter: Benjamin Mahler Based on MESOS-1474, we'd like to provide an HTTP API on the master for the maintenance primitives in mesos. Something like this for manipulating the schedule: # GET /maintenance/schedule # POST /maintenance/schedule (authenticated): many hosts # POST /maintenance/schedule/hostname (authenticated): single host # DELETE /maintenance/schedule/hostname (authenticated): single host Something like this for checking the status / progress: # GET /maintennace/status -- This message was sent by Atlassian JIRA (v6.3.4#6332)