> On April 2, 2019, 12:50 a.m., Gastón Kleiman wrote:
> > src/slave/slave.hpp
> > Lines 889 (patched)
> > <https://reviews.apache.org/r/70335/diff/1/?file=2136016#file2136016line889>
> >
> >     Why do we need this hashmap? I see that the code inserts/removes 
> > entries, but nothing ever gets anything from it.

Yea we can nuke this.


> On April 2, 2019, 12:50 a.m., Gastón Kleiman wrote:
> > src/slave/slave.cpp
> > Lines 4439-4443 (patched)
> > <https://reviews.apache.org/r/70335/diff/1/?file=2136017#file2136017line4439>
> >
> >     Should we also unschedule the garbage collection of the framework work 
> > directory?
> >     
> >     It gets scheduled for GC here: 
> > https://github.com/apache/mesos/blob/1acb38c27306326a53f866b5386b5e28a6dc9314/src/slave/slave.cpp#L6834-L6838

My thought was that we don't need to unschedule the work dir because, in the 
case of non-launch operations, we won't be creating any executor/task 
directories in the framework's work dir. However, the following sequence seems 
possible:
1) All of a framework's tasks on an agent go terminal, so the framework is 
removed and its meta/work dirs are scheduled for GC
2) The framework submits a non-launch operation on this agent, and we create a 
new framework in this patch but do NOT unschedule GC on the framework's work dir
3) The framework launches a task on this agent, we create a new executor dir in 
it's work dir, and then later on the framework's work dir is incorrectly GC'd 
because we never unscheduled it in #2

So, I think you're right. We should unschedule GC on the framework's work dir.


> On April 2, 2019, 12:50 a.m., Gastón Kleiman wrote:
> > src/slave/slave.cpp
> > Lines 4459 (patched)
> > <https://reviews.apache.org/r/70335/diff/1/?file=2136017#file2136017line4459>
> >
> >     I noticed that `Slave::run()` only checkpoints the framework if 
> > `frameworkInfo.checkpoint()` returns `true`.
> >     
> >     Should we maybe not checkpoint operations/frameworkInfos of frameworks 
> > that set `checkpoint` to `false`?

I think the case of operations is different. Since non-launch operations are 
not coupled to the framework lifecycle the way tasks are, we want to retain the 
FrameworkInfo even if the framework does not have checkpointing enabled. For 
example, a framework could submit a CREATE_DISK operation which takes a long 
time, then immediately tear itself down. If checkpointing is not enabled, we 
would still want to have the FrameworkInfo checkpointed so that if the master 
fails over and the agent reregisters, we can create the Framework struct in the 
master and track the operation.


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70335/#review214247
-----------------------------------------------------------


On March 28, 2019, 11:27 p.m., Greg Mann wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70335/
> -----------------------------------------------------------
> 
> (Updated March 28, 2019, 11:27 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benno Evers, Chun-Hung Hsiao, 
> Gastón Kleiman, Joseph Wu, and Meng Zhu.
> 
> 
> Bugs: MESOS-8582
>     https://issues.apache.org/jira/browse/MESOS-8582
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch updates the master to send a framework's full
> `FrameworkInfo` to the agent in the `ApplyOperationMessage`.
> The agent is updated to checkpoint frameworks when applying
> operations, and to unschedule GC on the meta directory when
> a new framework is created.
> 
> The test `TerminalOrphanOperationAfterMasterFailover` is
> removed since this patch eliminates the case of orphan
> operations relevant to that test.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp acc67d3763ddee9027e6cf375f1d495ff5805026 
>   src/messages/messages.proto e30ad34cc9212b05f85ba5e1d4fcfc9e49ae92c0 
>   src/slave/slave.hpp 2bffdc4b1e282d3c6dae2dcf23584a8a4550bf94 
>   src/slave/slave.cpp 5373cee5d30c2403497939eeba2ee5405117237e 
>   src/tests/persistent_volume_tests.cpp 
> 7e929a5a3a92e16a5dec10206f37caebc20d66a8 
>   src/tests/reservation_tests.cpp cd84cd24d3587fafc01ae1861f22c47262f2d7e9 
>   src/tests/storage_local_resource_provider_tests.cpp 
> 55c9389453754227de31e0a76d32ba663cc8ca7c 
> 
> 
> Diff: https://reviews.apache.org/r/70335/diff/1/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Greg Mann
> 
>

Reply via email to