I don't want to be negative, in that in concept, the idea has merit. That
said, I am extremely concerned about performance. If there is a x%
performance hit on this, and there is another method that may take more
work, but not have the performance hit, I think we should focus on that.  I
understand that there may be "smallish" applications, that may work for
this, however, I think it's a danger of scale in that while it may work on
a small scale in dev/testing, someone who tried that approach and then
TRIES to scale may be severely disappointed.

On Wed, May 20, 2015 at 8:49 PM, Swapnil Daingade <
[email protected]> wrote:

> Trying to send image again. This time as attachment.
>
> Regards
> Swapnil
>
>
> On Wed, May 20, 2015 at 5:43 PM, Swapnil Daingade <
> [email protected]> wrote:
>
>> Hi John,
>>
>> Are you suggesting something like this ?
>>
>> In issue 96 we are proposing something that will not require port mapping.
>> Can you take a look and give your thoughts
>> https://github.com/mesos/myriad/issues/96
>>
>> Regards
>> Swapnil
>>
>> ​
>>
>> On Fri, May 15, 2015 at 6:44 AM, John Omernik <[email protected]> wrote:
>>
>>> This is true. In this setup thought, we wouldn't be using the "random
>>> ports" We'd be assigning the ports that will be used by the RM (the 5)
>>> per
>>> cluster (with config changes) a head of time.  That is what the RM would
>>> know as its ports.  At this point, when marathon spins up a RM, HA proxy
>>> would take the service ports (which would be the same ports the RM
>>> "thinks"
>>> is running on) and forward them to the ports that mesos has proxied (in
>>> the
>>> available ports list). I've done this in Docker, but not on native
>>> marathon
>>> run processes. I need to look into that more.
>>>
>>> One concern I have with the HAProxy is long running TCP connections (I am
>>> not sure if this applies to Yarn/RM)  Basically on one particular use
>>> case:
>>> Running a Hive Thrift (hiveserver2) service in docker on the mesos
>>> cluster
>>> with HAProxy. I found if I submitted a query that was long, that the
>>> query
>>> would be submitted, and HAProxy would not seen connections for a while
>>> and
>>> kill the proxy to the backend. This was annoying to say the least.
>>>  Would
>>> this occur with HAProxy? I really think that if the haproxy-marathon
>>> bridge
>>> would be used we'd have to be certain that condition wouldn't occur, even
>>> hidden. (I would hate for something to happen where that condition
>>> occurs,
>>> however, Yarn is able to "reset" without error, adding a bit of latency
>>> to
>>> the process, and have that go unaddressed).
>>>
>>> So other than the HAProxy weirdness I saw, that approach could work, and
>>> then mesos-dns is just a nice component for administrators and users.
>>> What
>>> do I mean by that?
>>>
>>> Well, let's say you have a cluster of node1, node2, node3, and node4.
>>>
>>> You assign the 5 yarn ports (and service ports) for that cluster to be
>>> 15000, 15001, 15002, 15003, 15004.
>>>
>>> Myriad starts a node manager. It sets in the RM config (and all NM
>>>  configs) the ports based on the 5 above
>>>
>>> Mesos grabs 5 random ports in it's allowed range (default 30000 to 31000)
>>>
>>> When Mesos starts the RM process, lets say it starts it on node2.
>>>
>>> Node2 now has ports 30000,30001,30002,30003,and 30004 listening and is
>>> forwarding those to 15000,15001,15002,15003, and 15004 on the listening
>>> process.  (Note, I know this is doable with Docker contained processes,
>>> can
>>> Marathon do it outside of docker?)
>>>
>>> Now haproxy's config is updated. on EVERY node, the ports 15000-15004 are
>>> listening and are forwarding to Node2 on ports 30000-30004.
>>>
>>> To your point on "needing" mesos-dns. Technically no, we don't need it.
>>> we
>>> can tell our NMs to connect to any node on ports 15000-15004. This will
>>> work. But it's we may get added latency (rack to rack forwarding etc
>>> extra
>>> hops).
>>>
>>> Instead, if we set the NMs to connect to myriad-dev-1.marathon.mesos  It
>>> could return an IP that is THE node it's running on.  That way we get the
>>> advantage of having the NMs connect to the box with the process.  HA
>>> proxy
>>> takes the requests, and sends to the mesos ports (30000-30004) which
>>> Mesos
>>> then sends to the process on ports 15000-15004.
>>>
>>> So without mesos-dns: you just connect to any node on the service ports
>>> and
>>> it "works" but when it comes to self documentation, connecting to
>>> myriad-dev-1.marathon.mesos seems more descriptive than saying the NM is
>>> on
>>> node2.yourdomain.  Especially when it's not... potential for
>>> administrative
>>> confusion.
>>>
>>> With mesos-dns, you connect to the descriptive name, and it works. But
>>> then
>>> given my concerns with HAProxy, do we even NEED it? All HAProxy is doing
>>> at
>>> that point is opening a port on a node, sending to another mesos approved
>>> port only to send it to the same port the process is listening on. Are we
>>> adding complexity?
>>>
>>> This is a great discussion as it speaks to some intrinsic challenges that
>>> exist in data center OSes :)
>>>
>>>
>>> .
>>>
>>>
>>> On Thu, May 14, 2015 at 1:50 PM, Santosh Marella <[email protected]>
>>> wrote:
>>>
>>> > I might be missing something, but I didn't understand why mesos-dns
>>> would
>>> > be required in addition to HAProxy. If we configure RM to bind to
>>> random
>>> > ports, but have RM reachable via HAProxy on RM's service ports, won't
>>> all
>>> > the clients (such as NMs/HiveServer2 etc) just use HAProxy to reach to
>>> RM?
>>> > If yes, why is mesos-dns needed?
>>> >
>>> > I have very limited knowledge about HAProxy configuration in a mesos
>>> > cluster. I just read through this doc:
>>> > https://docs.mesosphere.com/getting-started/service-discovery/ and
>>> what I
>>> > inferred is that a HAProxy instance runs on every slave node and if NM
>>> > running on a slave node has to reach to RM, it would simply use a RM's
>>> > address that looks like "localhost:99999" (where 99999 is a admin
>>> > identified RPC service port for RM).
>>> > Since HAProxy on NM's localhost listens on 99999, it just forwards the
>>> > traffic to RM's IP:RandomPort. Am I understanding this correctly?
>>> >
>>> > Thanks,
>>> > Santosh
>>> >
>>> > On Tue, May 12, 2015 at 5:41 AM, John Omernik <[email protected]>
>>> wrote:
>>> >
>>> > > The challenge I think is the ports. So we have 5 ports that are
>>> needed
>>> > for
>>> > > a RM, do we predefine those? I think Yuliya is saying yes, we
>>> should.  An
>>> > > interesting compromise... rather than truly random ports,  when we
>>> > define a
>>> > > Yarn cluster, we have the responsibility to define out 5 "service"
>>> ports
>>> > > using the Martahon/HA Proxy Service ports. (This now requires HA
>>> Proxy as
>>> > > well as mesos-dns.
>>> >
>>> > I'd recommend some work being done on documenting
>>> > > HAProxy for use with the haproxy script, I know that I stumbled a bit
>>> > > trying to get HAProxy setup, but that just may be my own lack of
>>> > knowledge
>>> > > on the subject) These ports will have to be available across the
>>> cluster,
>>> > > and will map to whichever ports Mesos Assigns to the RM.
>>> > >
>>> > > This makes sense to me, a "Yarn Cluster Creation" event on a Mesos
>>> > cluster
>>> > > is something we want to be flexible, but it's not something that will
>>> > > likely be "self service". I.e. we won't have users just creating Yarn
>>> > > clusters at will. It will likely be something that, when requested,
>>> the
>>> > > Admin can identify 5 available service ports, and lock those into
>>> that
>>> > > cluster... that way when the Yarn RM spins up, it has it's service
>>> ports
>>> > > defined (and thus the Node managers always know which ports to
>>> connect
>>> > to).
>>> > > Combined with Mesos DNS, this could actually work out very well, as
>>> you
>>> > can
>>> > > the name of the RM can be hard coded, and the ports will just work no
>>> > > matter which node it spins up.
>>> > >
>>> > > From an HA perspective, The only advantage at this point that
>>> > preallocating
>>> > > the failover RM is speed of recovery.  (and guarantee of resources
>>> being
>>> > > available if failover occurs).  Perhaps we could consider this as an
>>> > option
>>> > > for those who need fast or guaranteed recovery but not make it a
>>> > > requirement?
>>> > >
>>> > > The service port method will not work however for the node manager
>>> ports.
>>> > > That said, I "believe" that as myriad spins up a node manager, it can
>>> > > dynamically allocate the ports, and thus report those to the resource
>>> > > manager on registration. Someone may need to help me out on that
>>> one, as
>>> > I
>>> > > am not sure.  Also, since the node manager is host specific,
>>> mesos-dns is
>>> > > not required, it can register to the resource manager with what ever
>>> > ports
>>> > > are allocated, and the hostname it's running on.  I guess the
>>> question
>>> > here
>>> > > is, when Myriad requests the resources, and mesos allocates the
>>> ports,
>>> > can
>>> > > myriad, prior to actually starting the node manager, update the
>>> configs
>>> > > with the allocated ports?   Or is this even needed?
>>> > >
>>> > > This is a great discussion.
>>> > >
>>> > > On Mon, May 11, 2015 at 9:58 PM, yuliya Feldman
>>> > > <[email protected]
>>> > > > wrote:
>>> > >
>>> > > > As far as I understand in this case Apache YARN RM HA will kick in
>>> -
>>> > > which
>>> > > > means all the ids, hosts, ports for all RMs will need to be defined
>>> > > > somewhere and I wonder how it will be defined in this situation
>>> since
>>> > > those
>>> > > > either need to be in yarn-site.xml or using "-D".
>>> > > > In case of Mesos-DNS usage no need to setup RM HA at all and no
>>> warm
>>> > > > standby needed. Marathon will start RM somewhere in case of
>>> failure and
>>> > > > clients will rediscover it based on the same hostname.
>>> > > > Am I missing anything?
>>> > > >       From: Adam Bordelon <[email protected]>
>>> > > >  To: [email protected]
>>> > > >  Sent: Monday, May 11, 2015 7:26 PM
>>> > > >  Subject: Re: Recommending or requiring mesos dns?
>>> > > >
>>> > > > I'm a +1 for random ports. You can also use Marathon's servicePort
>>> > field
>>> > > to
>>> > > > let HAProxy redirect from the servicePort to the actual hostPort
>>> for
>>> > the
>>> > > > service on each node. Mesos-DNS will similarly direct you to the
>>> > correct
>>> > > > host:port given the appropriate task name.
>>> > > >
>>> > > > Is there a reason we can't just have Marathon launch two RM tasks
>>> for
>>> > the
>>> > > > same YARN cluster? One would be the leader, and the other would
>>> > redirect
>>> > > to
>>> > > > it until failover. Once one fails over, the other will start taking
>>> > > > traffic, and Marathon will try to launch a new backup RM when the
>>> > > resources
>>> > > > are available. If the YARN RM cannot provide us this functionality
>>> on
>>> > its
>>> > > > own, perhaps we can write a simple wrapper script for it.
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Fri, May 8, 2015 at 11:57 AM, John Omernik <[email protected]>
>>> > wrote:
>>> > > >
>>> > > > > I would advocate random ports  because there should not be a
>>> > limitation
>>> > > > of
>>> > > > > running only one RM per node.  If we want true portability, there
>>> > > should
>>> > > > be
>>> > > > > the ability to have RM for the cluster YarnProd to run to run on
>>> > node1
>>> > > > and
>>> > > > > also have RM for the cluster YarnDev running on Node1. (if it so
>>> > > happens
>>> > > > to
>>> > > > > land this way).  That way the number of clusters isn't limited
>>> by the
>>> > > > > number of physical nodes.
>>> > > > >
>>> > > > > On Fri, May 8, 2015 at 1:33 PM, Santosh Marella <
>>> > [email protected]
>>> > > >
>>> > > > > wrote:
>>> > > > >
>>> > > > > > RM can store its data either in HDFS or in ZooKeeper. The data
>>> > store
>>> > > is
>>> > > > > > configurable. There is a config property in YARN
>>> > > > > > (yarn.resourcemanager.recovery.enabled) that tells RM whether
>>> it
>>> > > should
>>> > > > > try
>>> > > > > > to recover the metadata about the previously submitted apps,
>>> the
>>> > > > > containers
>>> > > > > > allocated to them etc from the state store.
>>> > > > > >
>>> > > > > > Pre allocation of a backup rm is a great idea. Thinking about
>>> it a
>>> > > bit
>>> > > > > > more, I felt it might be better to have such an option
>>> available in
>>> > > > > > Marathon rather than building it in Myriad (and in all
>>> > > > > frameworks/services
>>> > > > > > that wants HA/failover).
>>> > > > > >
>>> > > > > >  Let's say we launch a service X via marathon that requires
>>> some
>>> > > > > resources
>>> > > > > > (cpus/mem/ports) and we want 1 instance of that service to be
>>> > always
>>> > > > > > available. Marathon promises restart of the service if it goes
>>> > down.
>>> > > > But,
>>> > > > > > as far as I understand, marathon can restart the service on
>>> another
>>> > > > node
>>> > > > > > only if the resources required by service X are available on
>>> that
>>> > > node
>>> > > > > > *after* the service goes down. In other words, Marathon doesn't
>>> > > > > proactively
>>> > > > > > "reserve" these resources on another node as a backup for
>>> failover.
>>> > > > > >
>>> > > > > > Again, not all services launched via Marathon requires this,
>>> but
>>> > > > perhaps
>>> > > > > > there should be an config option to specify if a service
>>> desires to
>>> > > > have
>>> > > > > > marathon keep a backup node ready-to-go in the event of
>>> failure.
>>> > > > > >
>>> > > > > >
>>> > > > > > On Thu, May 7, 2015 at 4:12 PM, John Omernik <[email protected]
>>> >
>>> > > wrote:
>>> > > > > >
>>> > > > > > > So I may be lookng at this wrong, but where is the data for
>>> the
>>> > rm
>>> > > > > stored
>>> > > > > > > if it does fail over? How will it know to pick up where it
>>> left
>>> > > off?
>>> > > > > This
>>> > > > > >
>>> > > > > > is just one area I am low in understanding on.
>>> > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > > > >  That said, what about pre allocating a second failover rm
>>> some
>>> > > where
>>> > > > > on
>>> > > > > > > the cluster.  (I am just tossing an idea here, in that there
>>> are
>>> > > > > probably
>>> > > > > > > many reasons not to do this) but here is how I could see it
>>> > > > happening.
>>> > > > > > >
>>> > > > > > 1. Myriad starts a rm asking for 5 random available ports.
>>> Mesos
>>> > > > replies
>>> > > > > > > starting the rm and reports to myriad the 5 ports used for
>>> the
>>> > > > services
>>> > > > > > you
>>> > > > > > > listed below.
>>> > > > > > >
>>> > > > > > > 2. Myriad then checks a config value of number of "hot
>>> spares"
>>> > lets
>>> > > > say
>>> > > > > > we
>>> > > > > > > specify 1. Myriad then puts in a resource request to mesos
>>> for
>>> > CPU
>>> > > > and
>>> > > > > > > memory required for the rm, but specifically asks for the
>>> same 5
>>> > > > ports
>>> > > > > > > allocated to the first. Basically it reserves a spot on
>>> another
>>> > > node
>>> > > > > with
>>> > > > > > > the same ports available. It may tak a bit, but there should
>>> be
>>> > > that
>>> > > > > > > availability. Until this request is met, the yarn cluster is
>>> in a
>>> > > ha
>>> > > > > > > compromised position.
>>> > > > > > >
>>> > > > > >
>>> > > > > >    This is exactly what I think we should do, but why use
>>> random
>>> > > ports
>>> > > > > > instead of standard RM ports? If you have 10 slave nodes in
>>> your
>>> > > mesos
>>> > > > > > cluster, then there are 10 potential spots for RM to be
>>> launched
>>> > on.
>>> > > > > > However, if you choose to launch multiple RMs (multiple YARN
>>> > > clusters),
>>> > > > > > then you can probably launch utmost 5 (with remaining 5 nodes
>>> > > available
>>> > > > > >
>>> > > > > > >
>>> > > > > > > 3. At this point the perhaps we start another instance of rm
>>> > right
>>> > > > away
>>> > > > > > > (depends on my first question on where the rm stores into
>>> about
>>> > > > > > > jobs/applications) or the frame work just holds the spot,
>>> waiting
>>> > > > for a
>>> > > > > > > lack of heart beat (failover condition) on the primay
>>> resource
>>> > > > manager.
>>> > > > > > >
>>> > > > > > > 4. If we can run the spare with no issues, it's a simple
>>> update
>>> > of
>>> > > > the
>>> > > > > > dns
>>> > > > > > > record and node managers connect to the new rm ( and another
>>> rm
>>> > is
>>> > > > > > > preallocated for redundancy). If we can't actually execute
>>> the
>>> > > > > secondary
>>> > > > > > rm
>>> > > > > > > until failover conditions, we can now execute the new rm,
>>> and the
>>> > > > ports
>>> > > > > > > will be the same.
>>> > > > > > >
>>> > > > > > > This may seem kludgey at first, but done correctly, it may
>>> > actually
>>> > > > > limit
>>> > > > > > > the length of failover time as the rm is preallocated.  Rms
>>> are
>>> > not
>>> > > > > huge
>>> > > > > > > from a resource perspective thus it may be a small cost for
>>> those
>>> > > who
>>> > > > > > want
>>> > > > > > > failover and multiple clusters (thus having dynamic ports)
>>> > > > > > >
>>> > > > > > > I will keep thinking this through, and would welcome
>>> feedback.
>>> > > > > > >
>>> > > > > > > On Thursday, May 7, 2015, Santosh Marella <
>>> [email protected]
>>> > >
>>> > > > > wrote:
>>> > > > > > >
>>> > > > > > > > Hi John,
>>> > > > > > > >
>>> > > > > > > >  Great views about extending mesos dns for rm's discovery.
>>> Some
>>> > > > > > > thoughts:
>>> > > > > > > >    1. There are 5 primary interfaces RM exposes that are
>>> bound
>>> > to
>>> > > > > > > standard
>>> > > > > > > > ports.
>>> > > > > > > >        a. RPC interface for clients that want to submit
>>> > > > applications
>>> > > > > > to
>>> > > > > > > > YARN (port 8032).
>>> > > > > > > >        b. RPC interface for NMs to connect back/HB to RM
>>> (port
>>> > > > > 8031).
>>> > > > > > > >        c. RPC interface for App Masters to connect back/HB
>>> to
>>> > RM
>>> > > > > (port
>>> > > > > > > > 8030).
>>> > > > > > > >        d. RPC interface for admin to interact with RM via
>>> CLI
>>> > > (port
>>> > > > > > > 8033).
>>> > > > > > > >        e. Web Interface for RM's UI (port 8088).
>>> > > > > > > >    2. When we launch RM using Marathon, it's probably
>>> better to
>>> > > > > mention
>>> > > > > > > in
>>> > > > > > > > marathon's config that RM will use the above ports. This is
>>> > > > because,
>>> > > > > if
>>> > > > > > > RM
>>> > > > > > > > doesn't listens on random ports (as opposed to the above
>>> listed
>>> > > > > > standard
>>> > > > > > > > ports), when RM fails over, the new RM gets ports that
>>> might be
>>> > > > > > different
>>> > > > > > > > from the ones used by the old RM. This makes the RM's
>>> discovery
>>> > > > hard,
>>> > > > > > > > especially post failover.
>>> > > > > > > >    3. It looks like what you are proposing is a way to
>>> update
>>> > > > > mesos-dns
>>> > > > > > > as
>>> > > > > > > > to what ports RM's services are listening on. And when RM
>>> fails
>>> > > > over,
>>> > > > > > > these
>>> > > > > > > > ports would get updated in mesos-dns. Is my understanding
>>> > > correct?
>>> > > > If
>>> > > > > > > yes,
>>> > > > > > > > one challenge I see is that the clients that want to
>>> connect to
>>> > > the
>>> > > > > > above
>>> > > > > > > > listed RM interfaces also need to pull the changes to RM's
>>> port
>>> > > > > numbers
>>> > > > > > > > from mesos-dns dynamically. Not sure how that might be
>>> > possible.
>>> > > > > > > >
>>> > > > > > > >  Regarding your question about NM ports
>>> > > > > > > >  1. NM has the following ports:
>>> > > > > > > >      a. RPC port for app masters to launch containers
>>> (this is
>>> > a
>>> > > > > > random
>>> > > > > > > > port).
>>> > > > > > > >      b. RPC port for localization service. (port 8040)
>>> > > > > > > >      c. Web port for NM's UI (port 8042).
>>> > > > > > > >    2. Ports (a) and (c) are relayed to RM when NM registers
>>> > with
>>> > > > RM.
>>> > > > > > Port
>>> > > > > > > > (b) is passed to a local container executor process via
>>> command
>>> > > > line
>>> > > > > > > args.
>>> > > > > > > >    3. As you rightly reckon, we need a mechanism at launch
>>> of
>>> > NM
>>> > > to
>>> > > > > > pass
>>> > > > > > > > the mesos allocated ports to NM for the above interfaces.
>>> We
>>> > can
>>> > > > try
>>> > > > > > > > to use variable
>>> > > > > > > > expansion
>>> > > > > > > > <
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/conf/Configuration.html
>>> > > > > > > > >
>>> > > > > > > > mechanism hadoop has to achieve this.
>>> > > > > > > >
>>> > > > > > > > Thanks,
>>> > > > > > > > Santosh
>>> > > > > > > >
>>> > > > > > > > On Thu, May 7, 2015 at 3:51 AM, John Omernik <
>>> [email protected]
>>> > > > > > > > <javascript:;>> wrote:
>>> > > > > > > >
>>> > > > > > > > > I've implemented mesos-dns and use marathon to launch my
>>> > myriad
>>> > > > > > > > framework.
>>> > > > > > > > > It shows up as myriad.marahon.mesos and makes it easy to
>>> find
>>> > > > what
>>> > > > > > node
>>> > > > > > > > the
>>> > > > > > > > > framework launched the resource manager on.
>>> > > > > > > > >
>>> > > > > > > > >  What if we made myriad mesos-dns aware, and prior to
>>> > launching
>>> > > > the
>>> > > > > > > yarn
>>> > > > > > > > > rm, it could register in mesos dns. This would mean both
>>> the
>>> > ip
>>> > > > > > > addresses
>>> > > > > > > > > and the ports (we need to figure out multiple ports in
>>> > > > mesos-dns).
>>> > > > > > Then
>>> > > > > > > > it
>>> > > > > > > > > could write out ports and host names in the nm configs by
>>> > > > checking
>>> > > > > > > mesos
>>> > > > > > > > > dns for which ports the resource manager is using.
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > > Side question:  when a node manager registers with the
>>> > resource
>>> > > > > > manager
>>> > > > > > > > > are the ports the nm is running on completely up to the
>>> nm?
>>> > Ie
>>> > > I
>>> > > > > can
>>> > > > > > > run
>>> > > > > > > > my
>>> > > > > > > > > nm web server any port, Yarn just explains that to the
>>> rm on
>>> > > > > > > > registration?
>>> > > > > > > > > Because then we need a mechanism at launch of the nm
>>> task to
>>> > > > > > understand
>>> > > > > > > > > which ports mesos has allocated to the nm and update the
>>> > > > yarn-site
>>> > > > > > for
>>> > > > > > > > that
>>> > > > > > > > > nm before launch.... Perhaps mesos-dns as a requirement
>>> isn't
>>> > > > > needed,
>>> > > > > > > > but I
>>> > > > > > > > > am trying to walk through options that get us closer to
>>> > > multiple
>>> > > > > yarn
>>> > > > > > > > > clusters on a mesos cluster.
>>> > > > > > > > >
>>> > > > > > > > > John
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > --
>>> > > > > > > > > Sent from my iThing
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > --
>>> > > > > > > Sent from my iThing
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Reply via email to