date:20141110

Bernd Mathiske created MESOS-2056:
-

 Summary: Refactor fetcher code in preparation for fetcher cache
 Key: MESOS-2056
 URL: https://issues.apache.org/jira/browse/MESOS-2056
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
Priority: Minor


Refactor/rearrange fetcher-related code so that cache functionality can be 
dropped in. One could do both together in one go. This is splitting up reviews 
into smaller chunks. It will not immediately be obvious how this change will be 
used later, but it will look better-factored and still do the exact same thing 
as before. In particular, a download routine to be reused several times in 
launcher/fetcher will be factored out and the remainder of fetcher-related code 
can be moved from the containerizer realm into fetcher.cpp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2057) Add cache functionality with concurrent downloading to fetcher

Bernd Mathiske created MESOS-2057:
-

 Summary: Add cache functionality with concurrent downloading to 
fetcher
 Key: MESOS-2057
 URL: https://issues.apache.org/jira/browse/MESOS-2057
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2057) Add cache functionality with concurrent downloading to fetcher


 [ 
https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-2057:
--
Description: 
Add a URI flag to CommandInfo messages that indicates caching. When a URI is 
cached it is only ever downloaded once for the same user on the same slave as 
long as the slave keeps running. This even holds if multiple tasks request the 
same URI concurrently.

No cleanup or failover will be handled for now. Additional tickets will be 
filed for these enhancements. (So don't use this feature in production until 
the whole epic is complete.)

Depends on MESOS-2056.

 Add cache functionality with concurrent downloading to fetcher
 --

 Key: MESOS-2057
 URL: https://issues.apache.org/jira/browse/MESOS-2057
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske

 Add a URI flag to CommandInfo messages that indicates caching. When a URI is 
 cached it is only ever downloaded once for the same user on the same slave 
 as long as the slave keeps running. This even holds if multiple tasks request 
 the same URI concurrently.
 No cleanup or failover will be handled for now. Additional tickets will be 
 filed for these enhancements. (So don't use this feature in production until 
 the whole epic is complete.)
 Depends on MESOS-2056.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2057) Add cache functionality with concurrent downloading to fetcher

[
https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bernd Mathiske updated MESOS-2057:
--
Description:
Add a URI flag to CommandInfo messages that indicates caching. When a URI is
cached it is only ever downloaded once for the same user on the same slave as
long as the slave keeps running. This even holds if multiple tasks request the
same URI concurrently. Different URIs from different CommandInfos can be
downloaded concurrently.

No cleanup or failover will be handled for now. Additional tickets will be
filed for these enhancements. (So don't use this feature in production until
the whole epic is complete.)

Depends on MESOS-2056.

was:
Add a URI flag to CommandInfo messages that indicates caching. When a URI is
cached it is only ever downloaded once for the same user on the same slave as
long as the slave keeps running. This even holds if multiple tasks request the
same URI concurrently.

No cleanup or failover will be handled for now. Additional tickets will be
filed for these enhancements. (So don't use this feature in production until
the whole epic is complete.)

Depends on MESOS-2056.

Add cache functionality with concurrent downloading to fetcher
--

Key: MESOS-2057
URL: https://issues.apache.org/jira/browse/MESOS-2057
Project: Mesos
Issue Type: Improvement
Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske

Add a URI flag to CommandInfo messages that indicates caching. When a URI is
cached it is only ever downloaded once for the same user on the same slave
as long as the slave keeps running. This even holds if multiple tasks request
the same URI concurrently. Different URIs from different CommandInfos can be
downloaded concurrently.
No cleanup or failover will be handled for now. Additional tickets will be
filed for these enhancements. (So don't use this feature in production until
the whole epic is complete.)
Depends on MESOS-2056.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2057) Add cache functionality with concurrent downloading to fetcher

[
https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

No cache eviction, cleanup or failover will be handled for now. Additional
tickets will be filed for these enhancements. (So don't use this feature in
production until the whole epic is complete.)

Depends on MESOS-2056.

No cleanup or failover will be handled for now. Additional tickets will be
filed for these enhancements. (So don't use this feature in production until
the whole epic is complete.)

Depends on MESOS-2056.

Add cache functionality with concurrent downloading to fetcher
--

Key: MESOS-2057
URL: https://issues.apache.org/jira/browse/MESOS-2057
Project: Mesos
Issue Type: Improvement
Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske

Add a URI flag to CommandInfo messages that indicates caching. When a URI is
cached it is only ever downloaded once for the same user on the same slave
as long as the slave keeps running. This even holds if multiple tasks request
the same URI concurrently. Different URIs from different CommandInfos can be
downloaded concurrently.
No cache eviction, cleanup or failover will be handled for now. Additional
tickets will be filed for these enhancements. (So don't use this feature in
production until the whole epic is complete.)
Depends on MESOS-2056.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1919) Create IP address abstraction


[ 
https://issues.apache.org/jira/browse/MESOS-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205097#comment-14205097
 ] 

Dominic Hamon commented on MESOS-1919:
--

I think trying to bind on both seems reasonable. The abstraction for connecting 
should also try both, probably using [happy 
eyeballs|http://en.wikipedia.org/wiki/Happy_Eyeballs].

libprocess should almost certainly be extended to use a socket address instead 
of uint32_t _ip_ to store it.

 Create IP address abstraction
 -

 Key: MESOS-1919
 URL: https://issues.apache.org/jira/browse/MESOS-1919
 Project: Mesos
  Issue Type: Task
  Components: libprocess
Reporter: Dominic Hamon
Assignee: Evelina Dumitrescu
Priority: Minor

 in the code many functions need only the ip address to be passed as a 
 parameter. I don't think it would be desirable to use a struct 
 SockaddrStorage (MESOS-1916).
 Consider using a {{std::vectorunsigned char}} (see {{typedef 
 std::vectorunsigned char IPAddressNumber;}} in the Chromium project)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1876) Remove deprecated 'slave_id' field in ReregisterSlaveMessage.


 [ 
https://issues.apache.org/jira/browse/MESOS-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1876:
-
Labels: newbie twitter  (was: newbie)

 Remove deprecated 'slave_id' field in ReregisterSlaveMessage.
 -

 Key: MESOS-1876
 URL: https://issues.apache.org/jira/browse/MESOS-1876
 Project: Mesos
  Issue Type: Task
  Components: technical debt
Reporter: Benjamin Mahler
Priority: Trivial
  Labels: newbie, twitter

 This is to follow through on removing the deprecated field that we've been 
 phasing out. In 0.21.0, this field will no longer be read:
 {code}
 message ReregisterSlaveMessage {
   // TODO(bmahler): slave_id is deprecated.
   // 0.21.0: Now an optional field. Always written, never read.
   // 0.22.0: Remove this field.
   optional SlaveID slave_id = 1;
   required SlaveInfo slave = 2;
   repeated ExecutorInfo executor_infos = 4;
   repeated Task tasks = 3;
   repeated Archive.Framework completed_frameworks = 5;
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1867) Precision errors in UI


 [ 
https://issues.apache.org/jira/browse/MESOS-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1867:
-
Labels: easyfix newbie  (was: easyfix)

 Precision errors in UI
 --

 Key: MESOS-1867
 URL: https://issues.apache.org/jira/browse/MESOS-1867
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 0.20.0
 Environment: mesos 0.20.0, 3 masters, 25 slaves
Reporter: Ian Babrou
Priority: Trivial
  Labels: easyfix, newbie

 Just look at the image: http://i.imgur.com/oFx1M7B.png
 I have ~2500 completed tasks from Chronos, 256mb and 0.1 cpu each.
 At least UI should be fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1753) Allow default/deleted functions


 [ 
https://issues.apache.org/jira/browse/MESOS-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1753:
-
Labels: c++11 twitter  (was: c++11)

 Allow default/deleted functions
 ---

 Key: MESOS-1753
 URL: https://issues.apache.org/jira/browse/MESOS-1753
 Project: Mesos
  Issue Type: Improvement
Reporter: Dominic Hamon
Priority: Minor
  Labels: c++11, twitter

 Add default/delete functions to the configure script. Once there, we can 
 start using them across the code-base.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1708) Using the wrong resource name should report a better error.


 [ 
https://issues.apache.org/jira/browse/MESOS-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1708:
-
Labels: newbie twitter  (was: newbie)

 Using the wrong resource name should report a better error.
 -

 Key: MESOS-1708
 URL: https://issues.apache.org/jira/browse/MESOS-1708
 Project: Mesos
  Issue Type: Bug
  Components: framework, master
Reporter: Benjamin Hindman
  Labels: newbie, twitter

 If a scheduler launches a task using resources the master doesn't know about 
 the task validator causes the task to fail but the error message is not very 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1622) Move implementations from .hpp to .cpp


 [ 
https://issues.apache.org/jira/browse/MESOS-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1622:


Assignee: Dominic Hamon

 Move implementations from .hpp to .cpp
 --

 Key: MESOS-1622
 URL: https://issues.apache.org/jira/browse/MESOS-1622
 Project: Mesos
  Issue Type: Story
  Components: technical debt
Reporter: Isabel Jimenez
Assignee: Dominic Hamon
Priority: Minor
  Labels: newbie

 This issue is related to MESOS-1582, some headers have unnecessary inline 
 definitions and function declarations, to speed up build time we are 
 lightening headers. This issue will not apply to stout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1517) Maintain a queue of messages that arrive before the master recovers.


 [ 
https://issues.apache.org/jira/browse/MESOS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1517:
-
Labels: reliability twitter  (was: reliability)

 Maintain a queue of messages that arrive before the master recovers.
 

 Key: MESOS-1517
 URL: https://issues.apache.org/jira/browse/MESOS-1517
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler
  Labels: reliability, twitter

 Currently when the master is recovering, we drop all incoming messages. If 
 slaves and frameworks knew about the leading master only once it has 
 recovered, then we would only expect to see messages after we've recovered.
 We previously considered enqueuing all messages through the recovery future, 
 but this has the downside of forcing all messages to go through the master's 
 queue twice:
 {code}
   // TODO(bmahler): Consider instead re-enqueing *all* messages
   // through recover(). What are the performance implications of
   // the additional queueing delay and the accumulated backlog
   // of messages post-recovery?
   if (!recovered.get().isReady()) {
 VLOG(1)  Dropping '  event.message-name  ' message since 
  not recovered yet;
 ++metrics.dropped_messages;
 return;
   }
 {code}
 However, an easy solution to this problem is to maintain an explicit queue of 
 incoming messages that gets flushed once we finish recovery. This ensures 
 that all messages post-recovery are processed normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1424) Mesos tests should not rely on echo


 [ 
https://issues.apache.org/jira/browse/MESOS-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1424:
-
Labels: cleanup easyfix newbie osx tests  (was: cleanup easyfix osx tests)

 Mesos tests should not rely on echo
 ---

 Key: MESOS-1424
 URL: https://issues.apache.org/jira/browse/MESOS-1424
 Project: Mesos
  Issue Type: Improvement
  Components: test
Reporter: Till Toenshoff
Priority: Minor
  Labels: cleanup, easyfix, newbie, osx, tests

 Triggered by MESOS-1413 I would like to propose changing our tests to not 
 rely on {{echo}} but to use {{printf}} instead.
 This seems to be useful as {{echo}} is introducing an extra linefeed after 
 the supplied string whereas {{printf}} does not. The {{-n}} switch preventing 
 that extra linefeed is unfortunately not portable - it is not supported by 
 the builtin {{echo}} of the BSD / OSX {{/bin/sh}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1355) Use of untrusted string value in jvm.cpp


 [ 
https://issues.apache.org/jira/browse/MESOS-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1355:
-
Labels: coverity security  (was: coverity)

 Use of untrusted string value in jvm.cpp
 

 Key: MESOS-1355
 URL: https://issues.apache.org/jira/browse/MESOS-1355
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen
  Labels: coverity, security

 
 *** CID 1213892:  Use of untrusted string value  (TAINTED_STRING)
 /src/jvm/jvm.cpp: 66 in Jvm::create(const std::vectorstd::basic_stringchar, 
 std::char_traitschar, std::allocatorchar, 
 std::allocatorstd::basic_stringchar, std::char_traitschar, 
 std::allocatorchar , JNI::Version, bool)()
 60   std::string libJvmPath = os::getenv(JAVA_JVM_LIBRARY, false);
 61
 62   if (libJvmPath.empty()) {
 63 libJvmPath = mesos::internal::build::JAVA_JVM_LIBRARY;
 64   }
 65
  CID 1213892:  Use of untrusted string value  (TAINTED_STRING)
  Passing tainted string libJvmPath.c_str() to dlopen(char const *, 
  int), which cannot accept tainted data.
 66   void* handle = dlopen(libJvmPath.c_str(), RTLD_NOW);
 67
 68   if (handle == NULL) {
 69 return Error(dlerror());
 70   }
 71



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1353) Operands don't affect result in mesos_containerizer.cpp


 [ 
https://issues.apache.org/jira/browse/MESOS-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1353:
-
Labels: coverity newbie  (was: coverity)

 Operands don't affect result in mesos_containerizer.cpp
 ---

 Key: MESOS-1353
 URL: https://issues.apache.org/jira/browse/MESOS-1353
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen
Priority: Minor
  Labels: coverity, newbie

 May be a false positive - should be investigated.
 
 *** CID 1213887:  Operands don't affect result  (CONSTANT_EXPRESSION_RESULT)
 /src/slave/containerizer/mesos_containerizer.cpp: 416 in 
 mesos::internal::slave::execute(const mesos::CommandInfo , const 
 std::basic_stringchar, std::char_traitschar, std::allocatorchar, const 
 Optionstd::basic_stringchar, std::char_traitschar, std::allocatorchar 
 , const std::mapstd::basic_stringchar, std::char_traitschar, 
 std::allocatorchar, std::basic_stringchar, std::char_traitschar, 
 std::allocatorchar, std::lessstd::basic_stringchar, 
 std::char_traitschar, std::allocatorchar, 
 std::allocatorstd::pairconst std::basic_stringchar, 
 std::char_traitschar, std::allocatorchar, std::basic_stringchar, 
 std::char_traitschar, std::allocatorchar , bool, int, int, const 
 std::listOptionmesos::CommandInfo, 
 std::allocatorOptionmesos::CommandInfo )()
 410 if (chown.isError()) {
 411   ABORT(Failed to chown work directory);
 412 }
 413   }
 414
 415   // Enter working directory.
  CID 1213887:  Operands don't affect result  
  (CONSTANT_EXPRESSION_RESULT)
  chdir(directory)  0 is always false regardless of the values of 
  its operands. This occurs as the logical operand of if.
 416   if (os::chdir(directory)  0) {
 417 ABORT(Failed to chdir into work directory);
 418   }
 419
 420   // Redirect output to files in working dir if required. We append 
 because
 421   // others (e.g., mesos-fetcher) may have already logged to the 
 files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1194) protobuf-JSON rendering doesnt validate


 [ 
https://issues.apache.org/jira/browse/MESOS-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1194:
-
Labels: json newbie protobuf stout  (was: json protobuf stout)

 protobuf-JSON rendering doesnt validate
 ---

 Key: MESOS-1194
 URL: https://issues.apache.org/jira/browse/MESOS-1194
 Project: Mesos
  Issue Type: Bug
  Components: stout
Affects Versions: 0.19.0
Reporter: Till Toenshoff
Priority: Minor
  Labels: json, newbie, protobuf, stout

 When using JSON::Protobuf(Message), the supplied protobuf is not checked for 
 being properly initialized, hence e.g. required fields could be missing.
 It would be desirable to have a feedback mechanism in place for this 
 constructor - maybe this would do:
 {noformat}
 if (!message.IsInitialized()) { 
   std::cerr  Protobuf not initialized:   
 message.InitializationErrorString()  std::endl;
   abort();
 }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (MESOS-932) building spark0.8.1 with mesos 0.15.0 error becauseof protobuf is not compatible


 [ 
https://issues.apache.org/jira/browse/MESOS-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon closed MESOS-932.
---
Resolution: Fixed

 building spark0.8.1 with mesos 0.15.0 error becauseof protobuf is not 
 compatible
 

 Key: MESOS-932
 URL: https://issues.apache.org/jira/browse/MESOS-932
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 0.15.0
 Environment: hadoop2.2.0+spark0.8.0+mesos0.15.0
Reporter: liuhanbing
  Labels: test
   Original Estimate: 96h
  Remaining Estimate: 96h

 I've tried building spark0.8.1 with mesos 0.15.0 and I have the exact error:
 Stack: [0x7f4cd1734000,0x7f4cd1835000],  sp=0x7f4cd1833490,  free 
 space=1021k
 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
 code)
 V  [libjvm.so+0x514706]  unsigned+0xb6
 j  org.apache.mesos.MesosSchedulerDriver.initialize()V+0
 j  
 org.apache.mesos.MesosSchedulerDriver.init(Lorg/apache/mesos/Scheduler;Lorg/apache/mesos/Protos$FrameworkInfo;Ljava/lang/String;)V+62
 j  
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend$$anon$1.run()V+44
 v  ~StubRoutines::call_stub
 Java Threads: ( = current thread )
 =0x7f4cd4186800 JavaThread MesosSchedulerBackend driver daemon 
 [_thread_in_vm, id=19685, stack(0x7f4cd1734000,0x7f4cd1835000)]
   0x7f4cd4170800 JavaThread Timer-0 daemon [_thread_blocked, id=19684, 
 stack(0x7f4cd1835000,0x7f4cd1936000)]
   0x7f4cd4077000 JavaThread qtp1645300503-49 daemon [_thread_blocked, 
 id=19683, stack(0x7f4cd33f4000,0x7f4cd34f5000)]
   0x7f4cd4075000 JavaThread qtp1645300503-48 daemon [_thread_blocked, 
 id=19682, stack(0x7f4cd34f5000,0x7f4cd35f6000)]
 It looks like the protobuf is not compatible. spark0.8.1 has protobuf2.5.0 
 but mesos0.15.0 still has protobuf2.4.1.
 I've rebuild mesos 0.15.0 using protobuf 2.5.0.The error I am having still 
 seems to be related to protobuf seriously do not know how
 to try to debug that.
 any idears?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-719) missing-call-to-setgroups


[ 
https://issues.apache.org/jira/browse/MESOS-719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205217#comment-14205217
 ] 

Dominic Hamon commented on MESOS-719:
-

[~Ken Simon] [~tstclair] Can we get this patch uploaded to RB please?

 missing-call-to-setgroups
 -

 Key: MESOS-719
 URL: https://issues.apache.org/jira/browse/MESOS-719
 Project: Mesos
  Issue Type: Bug
  Components: general
Affects Versions: 0.15.0
Reporter: Timothy St. Clair
  Labels: newbie
 Attachments: MESOS-719-0.20.1.patch


 This traces into stout/os.hpp
 in vetting the code as part of fedora packaging, rpmlint outputs an error 
 around priv-changing .
 mesos.x86_64: E: missing-call-to-setgroups /usr/lib64/libmesos-0.15.0.so.0.0.0
 https://www.securecoding.cert.org/confluence/display/seccode/POS36-C.+Observe+correct+revocation+order+while+relinquishing+privileges



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-698) Implement requestResources() feature in the master


 [ 
https://issues.apache.org/jira/browse/MESOS-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-698:

Assignee: Vinod Kone

 Implement requestResources() feature in the master
 --

 Key: MESOS-698
 URL: https://issues.apache.org/jira/browse/MESOS-698
 Project: Mesos
  Issue Type: Improvement
  Components: framework, master, slave
Affects Versions: 0.12.0
Reporter: Sam Taha
Assignee: Vinod Kone
  Labels: features

 Scheduler.requestResources() does not remove filters on slaves.
 Looking at the code it doesn't look like the requestResources() feature is 
 implemented in the master. It is currently a no-op on the master/allocator. 
 The Request protobuf does have an optional slaveId which can be used to 
 perhaps remove any filters on the slave to allow offers to flow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-713) Support for adding subsystems to existing cgroup hierarchies.


 [ 
https://issues.apache.org/jira/browse/MESOS-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-713:

Labels: newbie twitter  (was: newbie)

 Support for adding subsystems to existing cgroup hierarchies.
 -

 Key: MESOS-713
 URL: https://issues.apache.org/jira/browse/MESOS-713
 Project: Mesos
  Issue Type: Improvement
  Components: isolation
Reporter: Benjamin Mahler
Priority: Minor
  Labels: newbie, twitter

 Currently if a slave is restarted with additional subsystems, it will refuse 
 to proceed if those subsystems are not attached to the existing hierarchy.
 It's possible to add subsystems to existing hierarchies via re-mounting:
 https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-Attaching_Subsystems_to_and_Detaching_Them_From_an_Existing_Hierarchy.html
 We can add support for this by calling mount with the MS_REMOUNT option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.


 [ 
https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-391:

Labels: newbie twitter  (was: newbie)

 Slave GarbageCollector needs to also take into account the number of links, 
 when determining removal time.
 --

 Key: MESOS-391
 URL: https://issues.apache.org/jira/browse/MESOS-391
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
  Labels: newbie, twitter

 The slave garbage collector does not take into account the number of links 
 present, which means that if we create a lot of executor directories (up to 
 LINK_MAX), we won't necessarily GC.
 As a result of this, the slave crashes:
 F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed 
 to create executor directory 
 '/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e':
  Too many links
 *** Check failure stack trace: ***
 @ 0x7f9320f82f9d  google::LogMessage::Fail()
 @ 0x7f9320f88c07  google::LogMessage::SendToLog()
 @ 0x7f9320f8484c  google::LogMessage::Flush()
 @ 0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7f9320c70312  _CheckSome::~_CheckSome()
 @ 0x7f9320c9dd5c  
 mesos::internal::slave::paths::createExecutorDirectory()
 @ 0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
 @ 0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
 @ 0x7f9320c9cb43  ProtobufProcess::handler4()
 @ 0x7f9320c8678b  std::tr1::_Function_handler::_M_invoke()
 @ 0x7f9320c9d1ab  ProtobufProcess::visit()
 @ 0x7f9320e4c774  process::MessageEvent::visit()
 @ 0x7f9320e40a1d  process::ProcessManager::resume()
 @ 0x7f9320e41268  process::schedule()
 @ 0x7f932055973d  start_thread
 @ 0x7f931ef3df6d  clone
 The fix here is to take into account the number of links (st_nlinks), when 
 determining whether we need to GC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-723) Expose total number of resources allocated to the slave in its endpoint


 [ 
https://issues.apache.org/jira/browse/MESOS-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Downes updated MESOS-723:
-
Fix Version/s: (was: 0.22.0)
   0.21.0

 Expose total number of resources allocated to the slave in its endpoint
 ---

 Key: MESOS-723
 URL: https://issues.apache.org/jira/browse/MESOS-723
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
Assignee: Vinod Kone
  Labels: twitter
 Fix For: 0.21.0


 This could be useful information if there are bugs in master/slave that 
 causes slaves to overcommit its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2052) RunState::recover should always recover 'completed'


 [ 
https://issues.apache.org/jira/browse/MESOS-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Downes updated MESOS-2052:
--
Fix Version/s: 0.21.0

 RunState::recover should always recover 'completed'
 ---

 Key: MESOS-2052
 URL: https://issues.apache.org/jira/browse/MESOS-2052
 Project: Mesos
  Issue Type: Bug
  Components: containerization, slave
Affects Versions: 0.20.0
Reporter: Ian Downes
Assignee: Ian Downes
 Fix For: 0.21.0


 RunState::recover() will return partial state if it cannot find or open the 
 libprocess pid file. Specifically, it does not recover the 'completed' flag.
 However, if the slave has removed the executor (because launch failed or the 
 executor failed to register) the sentinel flag will be set and this fact 
 should be recovered. This ensures that container recovery is not attempted 
 later.
 This was discovered when the LinuxLauncher failed to recover because it was 
 asked to recover two containers with the same forkedPid. Investigation showed 
 the executors both OOM'ed before registering, i.e., no libprocess pid file 
 was present. However, the containerizer had detected the OOM, destroyed the 
 container, and notified the slave which cleaned everything up: failing the 
 task and calling removeExecutor (which writes the completed sentinel file.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2046) Configure should check headers and libraries for svn and apr

2014-11-10 Thread Timothy Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen updated MESOS-2046:

Summary: Configure should check headers and libraries for svn and apr  
(was: Configure should check headers instead of libraries for svn and apr)

 Configure should check headers and libraries for svn and apr
 

 Key: MESOS-2046
 URL: https://issues.apache.org/jira/browse/MESOS-2046
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen

 For dependencies that we include headers, we need to check for the include 
 headers, which in configure we need to call AC_CHECK_HEADER instead of 
 AC_CHECK_LIB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2046) Configure should check headers and libraries for svn and apr

2014-11-10 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205254#comment-14205254
 ] 

Timothy Chen commented on MESOS-2046:
-

commit 42458bf2ad7fde97b18b167ba79b67effe43b0e9
Author: Timothy Chen tnac...@gmail.com
Date:   Thu Nov 6 14:01:33 2014 -0800

Added check for apr and svn headers besides libraries

Review: https://reviews.apache.org/r/27701

commit f8c89e8eda098be9c9a592e53edadc457796d3de
Author: Timothy Chen tnac...@gmail.com
Date:   Thu Nov 6 14:01:43 2014 -0800

Added check for apr and svn headers besides libraries in 3rdparty

Review: https://reviews.apache.org/r/27704

 Configure should check headers and libraries for svn and apr
 

 Key: MESOS-2046
 URL: https://issues.apache.org/jira/browse/MESOS-2046
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen

 For dependencies that we include headers, we need to check for the include 
 headers, which in configure we need to call AC_CHECK_HEADER instead of 
 AC_CHECK_LIB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (MESOS-2046) Configure should check headers and libraries for svn and apr

2014-11-10 Thread Timothy Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen closed MESOS-2046.
---

 Configure should check headers and libraries for svn and apr
 

 Key: MESOS-2046
 URL: https://issues.apache.org/jira/browse/MESOS-2046
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen

 For dependencies that we include headers, we need to check for the include 
 headers, which in configure we need to call AC_CHECK_HEADER instead of 
 AC_CHECK_LIB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (MESOS-2046) Configure should check headers and libraries for svn and apr


 [ 
https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Downes reopened MESOS-2046:
---

 Configure should check headers and libraries for svn and apr
 

 Key: MESOS-2046
 URL: https://issues.apache.org/jira/browse/MESOS-2046
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen
 Fix For: 0.21.0


 For dependencies that we include headers, we need to check for the include 
 headers, which in configure we need to call AC_CHECK_HEADER instead of 
 AC_CHECK_LIB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2046) Configure should check headers and libraries for svn and apr


 [ 
https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Downes updated MESOS-2046:
--
Fix Version/s: 0.21.0

 Configure should check headers and libraries for svn and apr
 

 Key: MESOS-2046
 URL: https://issues.apache.org/jira/browse/MESOS-2046
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen
 Fix For: 0.21.0


 For dependencies that we include headers, we need to check for the include 
 headers, which in configure we need to call AC_CHECK_HEADER instead of 
 AC_CHECK_LIB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-2046) Configure should check headers and libraries for svn and apr


 [ 
https://issues.apache.org/jira/browse/MESOS-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Downes resolved MESOS-2046.
---
Resolution: Fixed

 Configure should check headers and libraries for svn and apr
 

 Key: MESOS-2046
 URL: https://issues.apache.org/jira/browse/MESOS-2046
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen
 Fix For: 0.21.0


 For dependencies that we include headers, we need to check for the include 
 headers, which in configure we need to call AC_CHECK_HEADER instead of 
 AC_CHECK_LIB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1694) Future::failure should return a const string


[ 
https://issues.apache.org/jira/browse/MESOS-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205289#comment-14205289
 ] 

Dominic Hamon commented on MESOS-1694:
--

https://reviews.apache.org/r/27826

 Future::failure should return a const string
 -

 Key: MESOS-1694
 URL: https://issues.apache.org/jira/browse/MESOS-1694
 Project: Mesos
  Issue Type: Task
  Components: technical debt
Reporter: Dominic Hamon
Assignee: Dominic Hamon
Priority: Minor
  Labels: newbie





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2058) Deprecate stats.json endpoints for Master and Slave

Dominic Hamon created MESOS-2058:


 Summary: Deprecate stats.json endpoints for Master and Slave
 Key: MESOS-2058
 URL: https://issues.apache.org/jira/browse/MESOS-2058
 Project: Mesos
  Issue Type: Task
  Components: master, slave
Reporter: Dominic Hamon


With the introduction of the libprocess {{/metrics/snapshot}} endpoint, metrics 
are now duplicated in the Master and Slave between this and {{stats.json}}. We 
should deprecate the {{stats.json}} endpoints.

Manual inspection of {{stats.json}} shows that all metrics are now covered by 
the new endpoint for Master and Slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2039) Create a Design Doc for Dynamic Reservations

2014-11-10 Thread Michael Park (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-2039:

Description: 
A design doc to be shared with the community for the Dynamic Reservation epic.

Update: Shared [dynamic 
reservation|https://docs.google.com/document/d/1e3j69pfBgtc8xM00DhcuiMl6ImkEB5na0TzOMyzrg8A/edit?usp=sharing]
 with the dev mailing list.

  was:
A design doc to be shared with the community for the Dynamic Reservation epic.

Update: Shared [Design 
Doc|https://docs.google.com/document/d/1e3j69pfBgtc8xM00DhcuiMl6ImkEB5na0TzOMyzrg8A/edit?usp=sharing]
 with the dev mailing list.


 Create a Design Doc for Dynamic Reservations
 

 Key: MESOS-2039
 URL: https://issues.apache.org/jira/browse/MESOS-2039
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Michael Park
Assignee: Michael Park

 A design doc to be shared with the community for the Dynamic Reservation epic.
 Update: Shared [dynamic 
 reservation|https://docs.google.com/document/d/1e3j69pfBgtc8xM00DhcuiMl6ImkEB5na0TzOMyzrg8A/edit?usp=sharing]
  with the dev mailing list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2059) improve performance of expensive tests

Dominic Hamon created MESOS-2059:


 Summary: improve performance of expensive tests
 Key: MESOS-2059
 URL: https://issues.apache.org/jira/browse/MESOS-2059
 Project: Mesos
  Issue Type: Improvement
  Components: technical debt, test
Reporter: Dominic Hamon
Priority: Minor


Many of our tests take a long time to run which has an impact on the developer 
compile-test cycle. Improving the performance of the worst cases can lead to a 
significant improvement in developer workflow.

A quick test shows that focusing on a few key test fixtures might be worthwhile:

{noformat}
$ egrep '\(.* ms\)$' test.log | cut -d\  -f10- | cut -d\  -f1-2 | sed 's/(//' | 
sort -k2 -rn | head -n 30
ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors 15107
ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireMasterZKSession 13473
ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
 13434
ZooKeeperMasterContenderDetectorTest.MasterContenders 10089
ZooKeeperMasterContenderDetectorTest.MasterDetectorTimedoutSession 10081
ZooKeeperMasterContenderDetectorTest.ContenderDetectorShutdownNetwork 8459
ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSession 8424
ZooKeeperMasterContenderDetectorTest.MasterContender 8397
SlaveRecoveryTest/0.MultipleFrameworks 7971
ExamplesTest.PythonFramework 7326
HealthCheckTest.GracePeriod 6552
SlaveRecoveryTest/0.ReconcileKillTask 6150
ExamplesTest.LowLevelSchedulerPthread 6113
ExamplesTest.JavaFramework 5543
ExamplesTest.NoExecutorFramework 5391
ExamplesTest.TestFramework 5282
ExamplesTest.LowLevelSchedulerLibprocess 5282
ExamplesTest.JavaException 5177
ZooKeeperMasterContenderDetectorTest.ContenderPendingElection 5046
BasicMasterContenderDetectorTest.Detector 5010
BasicMasterContenderDetectorTest.Contender 5004
SlaveRecoveryTest/0.MultipleSlaves 4845
SlaveRecoveryTest/0.MasterFailover 4736
SlaveRecoveryTest/0.ShutdownSlave 4517
SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1 4482
SlaveRecoveryTest/0.Reboot 4481
SlaveRecoveryTest/0.KillTask 3600
SlaveRecoveryTest/0.SchedulerFailover 3542
SlaveRecoveryTest/0.ReconcileShutdownFramework 3534
GroupTest.GroupJoinWithDisconnect 3384
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1694) Future::failure should return a const string


[ 
https://issues.apache.org/jira/browse/MESOS-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205461#comment-14205461
 ] 

Dominic Hamon commented on MESOS-1694:
--

https://reviews.apache.org/r/27831/ introduces ABORT if string is not set which 
should simplify this patch.

 Future::failure should return a const string
 -

 Key: MESOS-1694
 URL: https://issues.apache.org/jira/browse/MESOS-1694
 Project: Mesos
  Issue Type: Task
  Components: technical debt
Reporter: Dominic Hamon
Assignee: Dominic Hamon
Priority: Minor
  Labels: newbie





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1694) Future::failure should return a const string


 [ 
https://issues.apache.org/jira/browse/MESOS-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1694:
-
Sprint: Twitter Mesos Q4 Sprint 3

 Future::failure should return a const string
 -

 Key: MESOS-1694
 URL: https://issues.apache.org/jira/browse/MESOS-1694
 Project: Mesos
  Issue Type: Task
  Components: technical debt
Reporter: Dominic Hamon
Assignee: Dominic Hamon
Priority: Minor
  Labels: newbie





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2059) improve performance of expensive tests


[ 
https://issues.apache.org/jira/browse/MESOS-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205489#comment-14205489
 ] 

Benjamin Mahler commented on MESOS-2059:


Linking in MESOS-1757 (an epic for speeding up tests), where I included a few 
starting suggestions and was hoping to have them broken off under the epic.

 improve performance of expensive tests
 --

 Key: MESOS-2059
 URL: https://issues.apache.org/jira/browse/MESOS-2059
 Project: Mesos
  Issue Type: Improvement
  Components: technical debt, test
Reporter: Dominic Hamon
Priority: Minor

 Many of our tests take a long time to run which has an impact on the 
 developer compile-test cycle. Improving the performance of the worst cases 
 can lead to a significant improvement in developer workflow.
 A quick test shows that focusing on a few key test fixtures might be 
 worthwhile:
 {noformat}
 $ egrep '\(.* ms\)$' test.log | cut -d\  -f10- | cut -d\  -f1-2 | sed 's/(//' 
 | sort -k2 -rn | head -n 30
 ZooKeeperMasterContenderDetectorTest.NonRetryableFrrors 15107
 ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireMasterZKSession 13473
 ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSessionNewMaster
  13434
 ZooKeeperMasterContenderDetectorTest.MasterContenders 10089
 ZooKeeperMasterContenderDetectorTest.MasterDetectorTimedoutSession 10081
 ZooKeeperMasterContenderDetectorTest.ContenderDetectorShutdownNetwork 8459
 ZooKeeperMasterContenderDetectorTest.MasterDetectorExpireSlaveZKSession 8424
 ZooKeeperMasterContenderDetectorTest.MasterContender 8397
 SlaveRecoveryTest/0.MultipleFrameworks 7971
 ExamplesTest.PythonFramework 7326
 HealthCheckTest.GracePeriod 6552
 SlaveRecoveryTest/0.ReconcileKillTask 6150
 ExamplesTest.LowLevelSchedulerPthread 6113
 ExamplesTest.JavaFramework 5543
 ExamplesTest.NoExecutorFramework 5391
 ExamplesTest.TestFramework 5282
 ExamplesTest.LowLevelSchedulerLibprocess 5282
 ExamplesTest.JavaException 5177
 ZooKeeperMasterContenderDetectorTest.ContenderPendingElection 5046
 BasicMasterContenderDetectorTest.Detector 5010
 BasicMasterContenderDetectorTest.Contender 5004
 SlaveRecoveryTest/0.MultipleSlaves 4845
 SlaveRecoveryTest/0.MasterFailover 4736
 SlaveRecoveryTest/0.ShutdownSlave 4517
 SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1 4482
 SlaveRecoveryTest/0.Reboot 4481
 SlaveRecoveryTest/0.KillTask 3600
 SlaveRecoveryTest/0.SchedulerFailover 3542
 SlaveRecoveryTest/0.ReconcileShutdownFramework 3534
 GroupTest.GroupJoinWithDisconnect 3384
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1474) Provide cluster maintenance primitives for operators.

[
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benjamin Mahler updated MESOS-1474:
---
Description:
Sometimes operators need to perform maintenance on a mesos cluster; we define
maintenance here as anything that requires the tasks to be drained on the
slave(s). Most mesos upgrades can be done without affecting running tasks, but
there are situations where maintenance is task-affecting:
* Host maintenance (e.g. hardware repair, kernel upgrades).
* Non-recoverable slave upgrades (e.g. adjusting slave attributes).
* etc

In order to ensure operators don’t violate frameworks’ SLAs, schedulers need to
be aware of planned unavailability events.

Maintenance awareness allows schedulers to avoid churn for long running tasks
by placing them on machines not undergoing maintenance. If all resources are
planned for maintenance, then the scheduler will prefer machines scheduled for
maintenance least imminently.

Maintenance awareness is also crucial when a scheduler uses [persistent
disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure
that the scheduler is aware of the expected duration of unavailability for a
persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate
1TB over the network when only 1 of the 3 replicas is going to be unavailable
for a reboot ( 1 hour)).

There are a few primitives of interest here:

* Provide a way for operators to [fully shutdown a
slave|https://issues.apache.org/jira/browse/MESOS-1475] (killing all tasks
underneath it). Colloquially known as a hard drain.
* Provide a way for operators to mark specific slaves as scheduled for
maintenance. This will inform the scheduler about the scheduled unavailability
of the resources.
* Provide a way for frameworks to be notified when resources are requested to
be relinquished. This gives the framework to proactively move a task before it
may be forcibly killed by an operator. It also allows the automation of
operations like: please drain these slaves within 1 hour.

See the [design
doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
for the latest details.

was:
Normally cluster upgrades can be done seamlessly using the built-in slave
recovery feature. However, there are situations where operators want to be able
to perform destructive maintenance operations on machines:

* Non-recoverable slave upgrades.
* Machine reboots.
* Kernel upgrades.
* Machine decommissioning.
* etc.

In these situations, best practice is to perform rolling maintenance in large
batches of machines. This can be problematic for frameworks when many related
tasks are located within a batch of machines going for maintenance.

There are a few primitives of interest here:

* Provide a way for operators to fully shutdown a slave (killing all tasks
underneath it).
* Provide a way for operators to mark specific slaves as undergoing
maintenance. This means that no more offers are being sent for these slaves,
and no new tasks will launch on them.
* Provide a way for frameworks to be notified when resources are requested to
be relinquished. This gives the framework to proactively move a task before it
is forcibly killed. It also allows the automation of operations like: please
drain these slaves within 1 hour.

Provide cluster maintenance primitives for operators.
-

Key: MESOS-1474
URL: https://issues.apache.org/jira/browse/MESOS-1474
Project: Mesos
Issue Type: Epic
Components: framework, master, slave
Reporter: Benjamin Mahler

Sometimes operators need to perform maintenance on a mesos cluster; we define
maintenance here as anything that requires the tasks to be drained on the
slave(s). Most mesos upgrades can be done without affecting running tasks,
but there are situations where maintenance is task-affecting:
* Host maintenance (e.g. hardware repair, kernel upgrades).
* Non-recoverable slave upgrades (e.g. adjusting slave attributes).
* etc
In order to ensure operators don’t violate frameworks’ SLAs, schedulers need
to be aware of planned unavailability events.
Maintenance awareness allows schedulers to avoid churn for long running tasks
by placing them on machines not undergoing maintenance. If all resources are
planned for maintenance, then the scheduler will prefer machines scheduled
for maintenance least imminently.
Maintenance awareness is also crucial when a scheduler uses [persistent
disk|https://issues.apache.org/jira/browse/MESOS-1554] resources, to ensure
that the scheduler is aware of the expected duration of unavailability for a
persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate
1TB over the network when only 1 of the 3 replicas is going to

[jira] [Updated] (MESOS-1470) Add operational documentation for running HA masters.

[
https://issues.apache.org/jira/browse/MESOS-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benjamin Mahler updated MESOS-1470:
---
Description:
Now that the master has replicated state on the disk, we should add
documentation that guides operators for common maintenance work:

* Swapping a master in the ensemble.
* Growing the master ensemble.
* Shrinking the master ensemble.

This would help craft similar documentation for users of the replicated log.

was:
Now that the master has replicated state on the disk, we should add
documentation that guides operators for common maintenance work:

* Swapping a master in the ensemble.
* Growing the master ensemble.
* Shrinking the master ensemble.

This would help craft similar documentation for users of the replicated log.

We should also add documentation for existing slave maintenance documentation:

* Best practices for rolling upgrades.
* How to shut down a slave.

This latter category will be incorporated with [~alexandra.sava]'s maintenance
work!

Affects Version/s: (was: 0.19.0)
Summary: Add operational documentation for running HA masters.
(was: Add cluster maintenance documentation.)

Add operational documentation for running HA masters.
-

Key: MESOS-1470
URL: https://issues.apache.org/jira/browse/MESOS-1470
Project: Mesos
Issue Type: Documentation
Components: documentation
Reporter: Benjamin Mahler
Labels: twitter

Now that the master has replicated state on the disk, we should add
documentation that guides operators for common maintenance work:
* Swapping a master in the ensemble.
* Growing the master ensemble.
* Shrinking the master ensemble.
This would help craft similar documentation for users of the replicated log.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2060) Add support for 'hooks' in task launch sequence

2014-11-10 Thread Niklas Quarfot Nielsen (JIRA)

Niklas Quarfot Nielsen created MESOS-2060:
-

 Summary: Add support for 'hooks' in task launch sequence
 Key: MESOS-2060
 URL: https://issues.apache.org/jira/browse/MESOS-2060
 Project: Mesos
  Issue Type: Task
  Components: modules
Reporter: Niklas Quarfot Nielsen


Similar to Apache Modules, hooks allows module writers to tie into internal 
components which may not be suitable to be abstracted entirely behind modules 
but rather let's them define actions on so-called hooks 
(http://httpd.apache.org/docs/2.2/developer/hooks.html).
In Apache Web Server, this lets people tie into the request processing cycle, 
in Mesos one place interesting place to start could be pre and post actions for 
master and slave task launch sequence.
Examples could be external statistics/metrics gathering, security 
infrastructure etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2061) Add InverseOffer protobuf message.

Benjamin Mahler created MESOS-2061:
--

 Summary: Add InverseOffer protobuf message.
 Key: MESOS-2061
 URL: https://issues.apache.org/jira/browse/MESOS-2061
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler


InverseOffer was defined as part of the maintenance work in MESOS-1474, design 
doc here: 
https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing

This ticket is to capture the addition of the InverseOffer protobuf to 
mesos.proto, the necessary API changes for Event/Call and the language bindings 
will be tracked separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2062) Add InverseOffer to Event/Call API.

Benjamin Mahler created MESOS-2062:
--

 Summary: Add InverseOffer to Event/Call API.
 Key: MESOS-2062
 URL: https://issues.apache.org/jira/browse/MESOS-2062
 Project: Mesos
  Issue Type: Task
  Components: c++ api
Reporter: Benjamin Mahler


The initial use case for InverseOffer in the framework API will be the 
maintenance primitives in mesos: MESOS-1474.

One way to add this is to tack it on to the OFFERS Event:

{code}
message Offers {
  repeated Offer offers = 1;
  repeated InverseOffer inverse_offers = 2;
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2064) Add InverseOffer to Java Scheduler API.

Benjamin Mahler created MESOS-2064:
--

 Summary: Add InverseOffer to Java Scheduler API.
 Key: MESOS-2064
 URL: https://issues.apache.org/jira/browse/MESOS-2064
 Project: Mesos
  Issue Type: Task
  Components: java api
Reporter: Benjamin Mahler


The initial use case for InverseOffer in the framework API will be the 
maintenance primitives in mesos: MESOS-1474.

One way to add these to the Java Scheduler API is to add a new callback:

{code}
  void inverseResourceOffers(
  SchedulerDriver driver,
  ListInverseOffer inverseOffers);
{code}

JAR / libmesos compatibility will need to be figured out here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2065) Add InverseOffer to Python Scheduler API.

Benjamin Mahler created MESOS-2065:
--

 Summary: Add InverseOffer to Python Scheduler API.
 Key: MESOS-2065
 URL: https://issues.apache.org/jira/browse/MESOS-2065
 Project: Mesos
  Issue Type: Task
  Components: python api
Reporter: Benjamin Mahler


The initial use case for InverseOffer in the framework API will be the 
maintenance primitives in mesos: MESOS-1474.

One way to add these to the Python Scheduler API is to add a new callback:

{code}
  def inverseResourceOffers(self, driver, inverse_offers):
{code}

Egg / libmesos compatibility will need to be figured out here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2064) Add InverseOffer to Java Scheduler API.


 [ 
https://issues.apache.org/jira/browse/MESOS-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2064:
---
Description: 
The initial use case for InverseOffer in the framework API will be the 
maintenance primitives in mesos: MESOS-1474.

One way to add these to the Java Scheduler API is to add a new callback:

{code}
  void inverseResourceOffers(
  SchedulerDriver driver,
  ListInverseOffer inverseOffers);
{code}

JAR / libmesos compatibility will need to be figured out here.

We may want to leave the Java binding untouched in favor of Event/Call, in 
order to not break API compatibility for schedulers.

  was:
The initial use case for InverseOffer in the framework API will be the 
maintenance primitives in mesos: MESOS-1474.

One way to add these to the Java Scheduler API is to add a new callback:

{code}
  void inverseResourceOffers(
  SchedulerDriver driver,
  ListInverseOffer inverseOffers);
{code}

JAR / libmesos compatibility will need to be figured out here.


 Add InverseOffer to Java Scheduler API.
 ---

 Key: MESOS-2064
 URL: https://issues.apache.org/jira/browse/MESOS-2064
 Project: Mesos
  Issue Type: Task
  Components: java api
Reporter: Benjamin Mahler

 The initial use case for InverseOffer in the framework API will be the 
 maintenance primitives in mesos: MESOS-1474.
 One way to add these to the Java Scheduler API is to add a new callback:
 {code}
   void inverseResourceOffers(
   SchedulerDriver driver,
   ListInverseOffer inverseOffers);
 {code}
 JAR / libmesos compatibility will need to be figured out here.
 We may want to leave the Java binding untouched in favor of Event/Call, in 
 order to not break API compatibility for schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2063) Add InverseOffer to C++ Scheduler API.


 [ 
https://issues.apache.org/jira/browse/MESOS-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2063:
---
Description: 
The initial use case for InverseOffer in the framework API will be the 
maintenance primitives in mesos: MESOS-1474.

One way to add these to the C++ Scheduler API is to add a new callback:

{code}
  virtual void inverseResourceOffers(
  SchedulerDriver* driver,
  const std::vectorInverseOffer inverseOffers) = 0;
{code}

libmesos compatibility will need to be figured out here.

We may want to leave the C++ binding untouched in favor of Event/Call, in order 
to not break API compatibility for schedulers.

  was:
The initial use case for InverseOffer in the framework API will be the 
maintenance primitives in mesos: MESOS-1474.

One way to add these to the C++ Scheduler API is to add a new callback:

{code}
  virtual void inverseResourceOffers(
  SchedulerDriver* driver,
  const std::vectorInverseOffer inverseOffers) = 0;
{code}

libmesos compatibility will need to be figured out here.


 Add InverseOffer to C++ Scheduler API.
 --

 Key: MESOS-2063
 URL: https://issues.apache.org/jira/browse/MESOS-2063
 Project: Mesos
  Issue Type: Task
  Components: c++ api
Reporter: Benjamin Mahler

 The initial use case for InverseOffer in the framework API will be the 
 maintenance primitives in mesos: MESOS-1474.
 One way to add these to the C++ Scheduler API is to add a new callback:
 {code}
   virtual void inverseResourceOffers(
   SchedulerDriver* driver,
   const std::vectorInverseOffer inverseOffers) = 0;
 {code}
 libmesos compatibility will need to be figured out here.
 We may want to leave the C++ binding untouched in favor of Event/Call, in 
 order to not break API compatibility for schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2065) Add InverseOffer to Python Scheduler API.


 [ 
https://issues.apache.org/jira/browse/MESOS-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2065:
---
Description: 
The initial use case for InverseOffer in the framework API will be the 
maintenance primitives in mesos: MESOS-1474.

One way to add these to the Python Scheduler API is to add a new callback:

{code}
  def inverseResourceOffers(self, driver, inverse_offers):
{code}

Egg / libmesos compatibility will need to be figured out here.

We may want to leave the Python binding untouched in favor of Event/Call, in 
order to not break API compatibility for schedulers.

  was:
The initial use case for InverseOffer in the framework API will be the 
maintenance primitives in mesos: MESOS-1474.

One way to add these to the Python Scheduler API is to add a new callback:

{code}
  def inverseResourceOffers(self, driver, inverse_offers):
{code}

Egg / libmesos compatibility will need to be figured out here.


 Add InverseOffer to Python Scheduler API.
 -

 Key: MESOS-2065
 URL: https://issues.apache.org/jira/browse/MESOS-2065
 Project: Mesos
  Issue Type: Task
  Components: python api
Reporter: Benjamin Mahler

 The initial use case for InverseOffer in the framework API will be the 
 maintenance primitives in mesos: MESOS-1474.
 One way to add these to the Python Scheduler API is to add a new callback:
 {code}
   def inverseResourceOffers(self, driver, inverse_offers):
 {code}
 Egg / libmesos compatibility will need to be figured out here.
 We may want to leave the Python binding untouched in favor of Event/Call, in 
 order to not break API compatibility for schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2061) Add InverseOffer protobuf message.


 [ 
https://issues.apache.org/jira/browse/MESOS-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2061:
---
Description: 
InverseOffer was defined as part of the maintenance work in MESOS-1474, design 
doc here: 
https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing

{code}
// A request to deallocate or return any resources already
// being consumed by the framework.
message InverseOffer {
  required OfferID id = 1;
  required FrameworkID framework_id = 2;
  repeated Resource resources = 3;
 
  // The slave ID if the resources need to be released on a particular slave.
  optional SlaveID slave_id = 4;
  
  // The executor and task IDs if the resources need to be released on specific
  // executors and/or tasks.
  optional ExecutorID executor_id = 6;
  repeated TaskID task_ids = 6;
 
  // The resources specified in this offer will become unavailable
  // at the specified start time and for the specified duration. Any
  // tasks running using these resources might get killed when
  // these resources become unavailable.
  required Unavailability unavailability = 7;
}
{code}

This ticket is to capture the addition of the InverseOffer protobuf to 
mesos.proto, the necessary API changes for Event/Call and the language bindings 
will be tracked separately.

  was:
InverseOffer was defined as part of the maintenance work in MESOS-1474, design 
doc here: 
https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing

This ticket is to capture the addition of the InverseOffer protobuf to 
mesos.proto, the necessary API changes for Event/Call and the language bindings 
will be tracked separately.


 Add InverseOffer protobuf message.
 --

 Key: MESOS-2061
 URL: https://issues.apache.org/jira/browse/MESOS-2061
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler
  Labels: twitter

 InverseOffer was defined as part of the maintenance work in MESOS-1474, 
 design doc here: 
 https://docs.google.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit?usp=sharing
 {code}
 // A request to deallocate or return any resources already
 // being consumed by the framework.
 message InverseOffer {
   required OfferID id = 1;
   required FrameworkID framework_id = 2;
   repeated Resource resources = 3;
  
   // The slave ID if the resources need to be released on a particular slave.
   optional SlaveID slave_id = 4;
   
   // The executor and task IDs if the resources need to be released on 
 specific
   // executors and/or tasks.
   optional ExecutorID executor_id = 6;
   repeated TaskID task_ids = 6;
  
   // The resources specified in this offer will become unavailable
   // at the specified start time and for the specified duration. Any
   // tasks running using these resources might get killed when
   // these resources become unavailable.
   required Unavailability unavailability = 7;
 }
 {code}
 This ticket is to capture the addition of the InverseOffer protobuf to 
 mesos.proto, the necessary API changes for Event/Call and the language 
 bindings will be tracked separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2065) Add InverseOffer to Python Scheduler API.


 [ 
https://issues.apache.org/jira/browse/MESOS-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2065:
---
Labels: twitter  (was: )

 Add InverseOffer to Python Scheduler API.
 -

 Key: MESOS-2065
 URL: https://issues.apache.org/jira/browse/MESOS-2065
 Project: Mesos
  Issue Type: Task
  Components: python api
Reporter: Benjamin Mahler
  Labels: twitter

 The initial use case for InverseOffer in the framework API will be the 
 maintenance primitives in mesos: MESOS-1474.
 One way to add these to the Python Scheduler API is to add a new callback:
 {code}
   def inverseResourceOffers(self, driver, inverse_offers):
 {code}
 Egg / libmesos compatibility will need to be figured out here.
 We may want to leave the Python binding untouched in favor of Event/Call, in 
 order to not break API compatibility for schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2066) Add optional 'Unavailability' to resource offers to provide maintenance awareness.

Benjamin Mahler created MESOS-2066:
--

 Summary: Add optional 'Unavailability' to resource offers to 
provide maintenance awareness.
 Key: MESOS-2066
 URL: https://issues.apache.org/jira/browse/MESOS-2066
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler


In order to inform frameworks about upcoming maintenance on offered resources, 
per MESOS-1474, we'd like to add an optional 'Unavailability' information to 
offers:

{code}
message Unavailability {
  required Time start = 1;
  // The approximate duration of the unavailability,
  // if this is a transient unavailability.
  optional Duration duration = 2;
}

message Offer {
  required OfferID id = 1;
  required FrameworkID framework_id = 2;
  required SlaveID slave_id = 3;
  required string hostname = 4;
  repeated Resource resources = 5;
  repeated Attribute attributes = 7;
  repeated ExecutorID executor_ids = 6;
 
  // The resources specified in this offer will become unavailable
  // at the specified start time and for the specified duration. Any
  // tasks launched using these resources might get killed when
  // these resources become unavailable.
  optional Unavailability unavailability = 8;
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2066) Add optional 'Unavailability' to resource offers to provide maintenance awareness.


 [ 
https://issues.apache.org/jira/browse/MESOS-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2066:
---
Labels: twitter  (was: )

 Add optional 'Unavailability' to resource offers to provide maintenance 
 awareness.
 --

 Key: MESOS-2066
 URL: https://issues.apache.org/jira/browse/MESOS-2066
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler
  Labels: twitter

 In order to inform frameworks about upcoming maintenance on offered 
 resources, per MESOS-1474, we'd like to add an optional 'Unavailability' 
 information to offers:
 {code}
 message Unavailability {
   required Time start = 1;
   // The approximate duration of the unavailability,
   // if this is a transient unavailability.
   optional Duration duration = 2;
 }
 message Offer {
   required OfferID id = 1;
   required FrameworkID framework_id = 2;
   required SlaveID slave_id = 3;
   required string hostname = 4;
   repeated Resource resources = 5;
   repeated Attribute attributes = 7;
   repeated ExecutorID executor_ids = 6;
  
   // The resources specified in this offer will become unavailable
   // at the specified start time and for the specified duration. Any
   // tasks launched using these resources might get killed when
   // these resources become unavailable.
   optional Unavailability unavailability = 8;
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2064) Add InverseOffer to Java Scheduler API.


 [ 
https://issues.apache.org/jira/browse/MESOS-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2064:
---
Labels: twitter  (was: )

 Add InverseOffer to Java Scheduler API.
 ---

 Key: MESOS-2064
 URL: https://issues.apache.org/jira/browse/MESOS-2064
 Project: Mesos
  Issue Type: Task
  Components: java api
Reporter: Benjamin Mahler
  Labels: twitter

 The initial use case for InverseOffer in the framework API will be the 
 maintenance primitives in mesos: MESOS-1474.
 One way to add these to the Java Scheduler API is to add a new callback:
 {code}
   void inverseResourceOffers(
   SchedulerDriver driver,
   ListInverseOffer inverseOffers);
 {code}
 JAR / libmesos compatibility will need to be figured out here.
 We may want to leave the Java binding untouched in favor of Event/Call, in 
 order to not break API compatibility for schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2067) Add HTTP API to the master for maintenance operations.