Re: Issues 16 and Issue 12

Swapnil Daingade Tue, 04 Aug 2015 13:53:57 -0700

Hi  Darin,

Thank you for taking a look at the documents.
I was wondering if there were any parts that would be better explained with
more details
or if we overlooked any scenarios.

Regarding 1
If I understand correctly, we are already doing task reconciliation for NM
Tasks as mentioned here
http://mesos.apache.org/documentation/latest/reconciliation/
in code here
https://github.com/mesos/myriad/blob/phase1/myriad-scheduler/src/main/java/com/ebay/myriad/scheduler/ReconcileService.java#L34
In case of FGS, we are planning to move the lifecycle management of the
container Tasks to the executor.
(NM + executor merge or rpc between NM and executor, whichever we end up
going with)
We are not storing container Tasks in the state store. Do you see any
additional work required for reconciliation ?
If we decide to support work preserving NM restart, we might have to do
some work to start the NM on same node on which it ran earlier.

Regarding 2
The only RMStateStore api that we override in MyriadFileSystemRMStateStore
is
RMState loadState()
which I believe should not change or is trivial to fix if changed. What I
like about the RMStateStore is that
you can plugin a new implementation simple by modifying a property in
yarn-site.xml. Rest yarn
will take care of loading the state store before it starts other services.

Regarding 3
I think we should have the flexibility to plugin multiple implementations
of storing Myriad state (including one that uses Mesos State API).
Let me look through the PR to see if any of the code makes assumptions
about using RMStateStore (An yarn specific interface).
If so, I'll change them to MyriadStateStore (this is a generic interface I
introduced that every Myriad State store implementation could implement).

Regards
Swapnil

On Mon, Aug 3, 2015 at 5:03 PM, Darin Johnson <[email protected]>
wrote:

> Swapnil,
>
> Looked over both Docs, HA and NM restart.  It's pretty high level so I'll
> look forward to the details.  Initial thoughts:
>
> 1. Getting framework reconciliation going would likely eliminate certain
> issues, such as sendFrameworkMessage being unreliable.  So should be
> implemented sooner than later.
>
> 2. How stable is the RMStateStore API? If there's changes between versions
> of Hadoop, might be best to use Mesos's State API.
>
> 3. There was no mention of running two RM's in traditional Hadoop RM HA
> (maybe in marathon even), but this should be considered a possibility. That
> may have been implicit.
>
> Saw the PR will look at it.
>
> Darin
> Hi Darin,
>
> The Myriad HA work will involve work related to issue 16.
> I already have the Myriad HA design doc for review.
> Your feedback on it would be really helpful.
> I also plan to send out for review parts of the Myriad HA implementation
> (although it does not address task reconciliation yet). I was planning to
> work on it next.
>
> Regards
> Swapnil
>
>
> On Mon, Aug 3, 2015 at 12:08 PM, Darin Johnson <[email protected]>
> wrote:
>
> > Is anyone actively working these?  I'm interested in both of these and
> > should have some cycles to work on them.
> >
> > One question I have on issue 12 is how the generalize Scheduling Policies
> > if we have autoscaling, fine grain scheduling, and fixed resources (with
> a
> > flexup/flexdown option).  Currently it seems as though FGS is embedded
> > pretty deeply.  Ideally though we could Have a SchedulerPolicy interface,
> > and users could specify the SchedulerPolicy via the Myriad config.
> >
> > If I don't get a response, I'll probably start issue 16 as it's straight
> > forward and write something up on 12.
> >
> > Darin
> >
>

Re: Issues 16 and Issue 12

Reply via email to