> On Nov. 16, 2017, 2:05 a.m., Jie Yu wrote:
> > src/master/master.cpp
> > Lines 7103-7120 (patched)
> > <https://reviews.apache.org/r/63732/diff/2/?file=1893485#file1893485line7103>
> >
> >     I feel that long term, this way of removing all and then add back will 
> > probably not work.
> >     
> >     Removing offer operation means we'll need to send status update (if the 
> > current state is not terminal). I'd suggest we only remove those that are 
> > not in the new list, and add those that are not in the old list.
> >     
> >     Same comments apply to the agent chagne.

I changed the code to only handle removed and added operations, but leave known 
operations alone so they can transition their states by observing relayed offer 
operation status updates.

Also in the agent.


> On Nov. 16, 2017, 2:05 a.m., Jie Yu wrote:
> > src/master/master.cpp
> > Lines 7108 (patched)
> > <https://reviews.apache.org/r/63732/diff/2/?file=1893485#file1893485line7108>
> >
> >     Think about the case where agent crashes and restarts, not all RP has 
> > re-registered yet. In that case, some operation from some not yet 
> > re-registered RP will not be part of this operation list.
> >     
> >     I don't think we want to remove those operations just yet. I think we 
> > should remove those operations only if the corresponding RP has 
> > re-registered with the agent and show up in the offer operation list (or 
> > total resources).
> >     
> >     This is similar to we don't remove tasks when agent disconnects. In 
> > fact, you should follow the similar patter in reconcileKnownSlave here. If 
> > an operation is unknown, instead of calling `removeOfferOperation` 
> > directly, we should probably send a `reconcileOfferOperationMessage` to the 
> > agent to ask the RP to generate a status update, and rely on status update 
> > handler to properly handle the resource accounting.
> >     
> >     Also realized that we probably should also do the same in the agent 
> > code. Instead of directly calling removeOfferOperation and 
> > addOfferOperation, send a `RECONCILE` message to RP to asking the RP to 
> > generate a status udpate. If it's unknown to the RP, RP will send 
> > OFFER_OPERATION_DROPPED, which is terminal.

I am not sure it should be the master's responsibility to keep these operations 
alive since it e.g., has no visibility into whether or when a resource provider 
might come back. It seems to me that we should instead let the agent decide 
when to prune operations from removed resource providers from the list of 
operations processed here. While it currently would not send operations for not 
yet resubscribed RPs after agent failover, these would appear together with the 
RP resources once it resubscribes, so I think the difference here is purely 
semantic. For the case of RP failover the agent will send operations from 
disconnected RPs with the list processed here.

In the end I feel the scenario of agent failover's task analogue is rather a 
master failover with some agent not yet registered. There we would not be able 
to provide information about tasks running on disconnected agents.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63732/#review191135
-----------------------------------------------------------


On Nov. 15, 2017, 6:31 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63732/
> -----------------------------------------------------------
> 
> (Updated Nov. 15, 2017, 6:31 p.m.)
> 
> 
> Review request for mesos, Jie Yu and Jan Schlicht.
> 
> 
> Bugs: MESOS-8207
>     https://issues.apache.org/jira/browse/MESOS-8207
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 59a533940736f5cfd5ec31e0ed924f0b2ab13f9c 
> 
> 
> Diff: https://reviews.apache.org/r/63732/diff/2/
> 
> 
> Testing
> -------
> 
> `make check`, still need to implement dedicated tests.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>

Reply via email to