Re: Recommending or requiring mesos dns?

John Omernik Fri, 15 May 2015 06:45:08 -0700

This is true. In this setup thought, we wouldn't be using the "random
ports" We'd be assigning the ports that will be used by the RM (the 5) per
cluster (with config changes) a head of time.  That is what the RM would
know as its ports.  At this point, when marathon spins up a RM, HA proxy
would take the service ports (which would be the same ports the RM "thinks"
is running on) and forward them to the ports that mesos has proxied (in the
available ports list). I've done this in Docker, but not on native marathon
run processes. I need to look into that more.


One concern I have with the HAProxy is long running TCP connections (I am
not sure if this applies to Yarn/RM)  Basically on one particular use case:
Running a Hive Thrift (hiveserver2) service in docker on the mesos cluster
with HAProxy. I found if I submitted a query that was long, that the query
would be submitted, and HAProxy would not seen connections for a while and
kill the proxy to the backend. This was annoying to say the least.   Would
this occur with HAProxy? I really think that if the haproxy-marathon bridge
would be used we'd have to be certain that condition wouldn't occur, even
hidden. (I would hate for something to happen where that condition occurs,
however, Yarn is able to "reset" without error, adding a bit of latency to
the process, and have that go unaddressed).

So other than the HAProxy weirdness I saw, that approach could work, and
then mesos-dns is just a nice component for administrators and users. What
do I mean by that?

Well, let's say you have a cluster of node1, node2, node3, and node4.

You assign the 5 yarn ports (and service ports) for that cluster to be
15000, 15001, 15002, 15003, 15004.

Myriad starts a node manager. It sets in the RM config (and all NM
 configs) the ports based on the 5 above

Mesos grabs 5 random ports in it's allowed range (default 30000 to 31000)

When Mesos starts the RM process, lets say it starts it on node2.

Node2 now has ports 30000,30001,30002,30003,and 30004 listening and is
forwarding those to 15000,15001,15002,15003, and 15004 on the listening
process.  (Note, I know this is doable with Docker contained processes, can
Marathon do it outside of docker?)

Now haproxy's config is updated. on EVERY node, the ports 15000-15004 are
listening and are forwarding to Node2 on ports 30000-30004.

To your point on "needing" mesos-dns. Technically no, we don't need it. we
can tell our NMs to connect to any node on ports 15000-15004. This will
work. But it's we may get added latency (rack to rack forwarding etc extra
hops).

Instead, if we set the NMs to connect to myriad-dev-1.marathon.mesos  It
could return an IP that is THE node it's running on.  That way we get the
advantage of having the NMs connect to the box with the process.  HA proxy
takes the requests, and sends to the mesos ports (30000-30004) which Mesos
then sends to the process on ports 15000-15004.

So without mesos-dns: you just connect to any node on the service ports and
it "works" but when it comes to self documentation, connecting to
myriad-dev-1.marathon.mesos seems more descriptive than saying the NM is on
node2.yourdomain.  Especially when it's not... potential for administrative
confusion.

With mesos-dns, you connect to the descriptive name, and it works. But then
given my concerns with HAProxy, do we even NEED it? All HAProxy is doing at
that point is opening a port on a node, sending to another mesos approved
port only to send it to the same port the process is listening on. Are we
adding complexity?

This is a great discussion as it speaks to some intrinsic challenges that
exist in data center OSes :)


.


On Thu, May 14, 2015 at 1:50 PM, Santosh Marella <[email protected]>
wrote:

> I might be missing something, but I didn't understand why mesos-dns would
> be required in addition to HAProxy. If we configure RM to bind to random
> ports, but have RM reachable via HAProxy on RM's service ports, won't all
> the clients (such as NMs/HiveServer2 etc) just use HAProxy to reach to RM?
> If yes, why is mesos-dns needed?
>
> I have very limited knowledge about HAProxy configuration in a mesos
> cluster. I just read through this doc:
> https://docs.mesosphere.com/getting-started/service-discovery/ and what I
> inferred is that a HAProxy instance runs on every slave node and if NM
> running on a slave node has to reach to RM, it would simply use a RM's
> address that looks like "localhost:99999" (where 99999 is a admin
> identified RPC service port for RM).
> Since HAProxy on NM's localhost listens on 99999, it just forwards the
> traffic to RM's IP:RandomPort. Am I understanding this correctly?
>
> Thanks,
> Santosh
>
> On Tue, May 12, 2015 at 5:41 AM, John Omernik <[email protected]> wrote:
>
> > The challenge I think is the ports. So we have 5 ports that are needed
> for
> > a RM, do we predefine those? I think Yuliya is saying yes, we should.  An
> > interesting compromise... rather than truly random ports,  when we
> define a
> > Yarn cluster, we have the responsibility to define out 5 "service" ports
> > using the Martahon/HA Proxy Service ports. (This now requires HA Proxy as
> > well as mesos-dns.
>
> I'd recommend some work being done on documenting
> > HAProxy for use with the haproxy script, I know that I stumbled a bit
> > trying to get HAProxy setup, but that just may be my own lack of
> knowledge
> > on the subject) These ports will have to be available across the cluster,
> > and will map to whichever ports Mesos Assigns to the RM.
> >
> > This makes sense to me, a "Yarn Cluster Creation" event on a Mesos
> cluster
> > is something we want to be flexible, but it's not something that will
> > likely be "self service". I.e. we won't have users just creating Yarn
> > clusters at will. It will likely be something that, when requested, the
> > Admin can identify 5 available service ports, and lock those into that
> > cluster... that way when the Yarn RM spins up, it has it's service ports
> > defined (and thus the Node managers always know which ports to connect
> to).
> > Combined with Mesos DNS, this could actually work out very well, as you
> can
> > the name of the RM can be hard coded, and the ports will just work no
> > matter which node it spins up.
> >
> > From an HA perspective, The only advantage at this point that
> preallocating
> > the failover RM is speed of recovery.  (and guarantee of resources being
> > available if failover occurs).  Perhaps we could consider this as an
> option
> > for those who need fast or guaranteed recovery but not make it a
> > requirement?
> >
> > The service port method will not work however for the node manager ports.
> > That said, I "believe" that as myriad spins up a node manager, it can
> > dynamically allocate the ports, and thus report those to the resource
> > manager on registration. Someone may need to help me out on that one, as
> I
> > am not sure.  Also, since the node manager is host specific, mesos-dns is
> > not required, it can register to the resource manager with what ever
> ports
> > are allocated, and the hostname it's running on.  I guess the question
> here
> > is, when Myriad requests the resources, and mesos allocates the ports,
> can
> > myriad, prior to actually starting the node manager, update the configs
> > with the allocated ports?   Or is this even needed?
> >
> > This is a great discussion.
> >
> > On Mon, May 11, 2015 at 9:58 PM, yuliya Feldman
> > <[email protected]
> > > wrote:
> >
> > > As far as I understand in this case Apache YARN RM HA will kick in -
> > which
> > > means all the ids, hosts, ports for all RMs will need to be defined
> > > somewhere and I wonder how it will be defined in this situation since
> > those
> > > either need to be in yarn-site.xml or using "-D".
> > > In case of Mesos-DNS usage no need to setup RM HA at all and no warm
> > > standby needed. Marathon will start RM somewhere in case of failure and
> > > clients will rediscover it based on the same hostname.
> > > Am I missing anything?
> > >       From: Adam Bordelon <[email protected]>
> > >  To: [email protected]
> > >  Sent: Monday, May 11, 2015 7:26 PM
> > >  Subject: Re: Recommending or requiring mesos dns?
> > >
> > > I'm a +1 for random ports. You can also use Marathon's servicePort
> field
> > to
> > > let HAProxy redirect from the servicePort to the actual hostPort for
> the
> > > service on each node. Mesos-DNS will similarly direct you to the
> correct
> > > host:port given the appropriate task name.
> > >
> > > Is there a reason we can't just have Marathon launch two RM tasks for
> the
> > > same YARN cluster? One would be the leader, and the other would
> redirect
> > to
> > > it until failover. Once one fails over, the other will start taking
> > > traffic, and Marathon will try to launch a new backup RM when the
> > resources
> > > are available. If the YARN RM cannot provide us this functionality on
> its
> > > own, perhaps we can write a simple wrapper script for it.
> > >
> > >
> > >
> > > On Fri, May 8, 2015 at 11:57 AM, John Omernik <[email protected]>
> wrote:
> > >
> > > > I would advocate random ports  because there should not be a
> limitation
> > > of
> > > > running only one RM per node.  If we want true portability, there
> > should
> > > be
> > > > the ability to have RM for the cluster YarnProd to run to run on
> node1
> > > and
> > > > also have RM for the cluster YarnDev running on Node1. (if it so
> > happens
> > > to
> > > > land this way).  That way the number of clusters isn't limited by the
> > > > number of physical nodes.
> > > >
> > > > On Fri, May 8, 2015 at 1:33 PM, Santosh Marella <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > RM can store its data either in HDFS or in ZooKeeper. The data
> store
> > is
> > > > > configurable. There is a config property in YARN
> > > > > (yarn.resourcemanager.recovery.enabled) that tells RM whether it
> > should
> > > > try
> > > > > to recover the metadata about the previously submitted apps, the
> > > > containers
> > > > > allocated to them etc from the state store.
> > > > >
> > > > > Pre allocation of a backup rm is a great idea. Thinking about it a
> > bit
> > > > > more, I felt it might be better to have such an option available in
> > > > > Marathon rather than building it in Myriad (and in all
> > > > frameworks/services
> > > > > that wants HA/failover).
> > > > >
> > > > >  Let's say we launch a service X via marathon that requires some
> > > > resources
> > > > > (cpus/mem/ports) and we want 1 instance of that service to be
> always
> > > > > available. Marathon promises restart of the service if it goes
> down.
> > > But,
> > > > > as far as I understand, marathon can restart the service on another
> > > node
> > > > > only if the resources required by service X are available on that
> > node
> > > > > *after* the service goes down. In other words, Marathon doesn't
> > > > proactively
> > > > > "reserve" these resources on another node as a backup for failover.
> > > > >
> > > > > Again, not all services launched via Marathon requires this, but
> > > perhaps
> > > > > there should be an config option to specify if a service desires to
> > > have
> > > > > marathon keep a backup node ready-to-go in the event of failure.
> > > > >
> > > > >
> > > > > On Thu, May 7, 2015 at 4:12 PM, John Omernik <[email protected]>
> > wrote:
> > > > >
> > > > > > So I may be lookng at this wrong, but where is the data for the
> rm
> > > > stored
> > > > > > if it does fail over? How will it know to pick up where it left
> > off?
> > > > This
> > > > >
> > > > > is just one area I am low in understanding on.
> > > > > >
> > > > > >
> > > > >
> > > > > >  That said, what about pre allocating a second failover rm some
> > where
> > > > on
> > > > > > the cluster.  (I am just tossing an idea here, in that there are
> > > > probably
> > > > > > many reasons not to do this) but here is how I could see it
> > > happening.
> > > > > >
> > > > > 1. Myriad starts a rm asking for 5 random available ports.  Mesos
> > > replies
> > > > > > starting the rm and reports to myriad the 5 ports used for the
> > > services
> > > > > you
> > > > > > listed below.
> > > > > >
> > > > > > 2. Myriad then checks a config value of number of "hot spares"
> lets
> > > say
> > > > > we
> > > > > > specify 1. Myriad then puts in a resource request to mesos for
> CPU
> > > and
> > > > > > memory required for the rm, but specifically asks for the same 5
> > > ports
> > > > > > allocated to the first. Basically it reserves a spot on another
> > node
> > > > with
> > > > > > the same ports available. It may tak a bit, but there should be
> > that
> > > > > > availability. Until this request is met, the yarn cluster is in a
> > ha
> > > > > > compromised position.
> > > > > >
> > > > >
> > > > >    This is exactly what I think we should do, but why use random
> > ports
> > > > > instead of standard RM ports? If you have 10 slave nodes in your
> > mesos
> > > > > cluster, then there are 10 potential spots for RM to be launched
> on.
> > > > > However, if you choose to launch multiple RMs (multiple YARN
> > clusters),
> > > > > then you can probably launch utmost 5 (with remaining 5 nodes
> > available
> > > > >
> > > > > >
> > > > > > 3. At this point the perhaps we start another instance of rm
> right
> > > away
> > > > > > (depends on my first question on where the rm stores into about
> > > > > > jobs/applications) or the frame work just holds the spot, waiting
> > > for a
> > > > > > lack of heart beat (failover condition) on the primay resource
> > > manager.
> > > > > >
> > > > > > 4. If we can run the spare with no issues, it's a simple update
> of
> > > the
> > > > > dns
> > > > > > record and node managers connect to the new rm ( and another rm
> is
> > > > > > preallocated for redundancy). If we can't actually execute the
> > > > secondary
> > > > > rm
> > > > > > until failover conditions, we can now execute the new rm, and the
> > > ports
> > > > > > will be the same.
> > > > > >
> > > > > > This may seem kludgey at first, but done correctly, it may
> actually
> > > > limit
> > > > > > the length of failover time as the rm is preallocated.  Rms are
> not
> > > > huge
> > > > > > from a resource perspective thus it may be a small cost for those
> > who
> > > > > want
> > > > > > failover and multiple clusters (thus having dynamic ports)
> > > > > >
> > > > > > I will keep thinking this through, and would welcome feedback.
> > > > > >
> > > > > > On Thursday, May 7, 2015, Santosh Marella <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi John,
> > > > > > >
> > > > > > >  Great views about extending mesos dns for rm's discovery. Some
> > > > > > thoughts:
> > > > > > >    1. There are 5 primary interfaces RM exposes that are bound
> to
> > > > > > standard
> > > > > > > ports.
> > > > > > >        a. RPC interface for clients that want to submit
> > > applications
> > > > > to
> > > > > > > YARN (port 8032).
> > > > > > >        b. RPC interface for NMs to connect back/HB to RM (port
> > > > 8031).
> > > > > > >        c. RPC interface for App Masters to connect back/HB to
> RM
> > > > (port
> > > > > > > 8030).
> > > > > > >        d. RPC interface for admin to interact with RM via CLI
> > (port
> > > > > > 8033).
> > > > > > >        e. Web Interface for RM's UI (port 8088).
> > > > > > >    2. When we launch RM using Marathon, it's probably better to
> > > > mention
> > > > > > in
> > > > > > > marathon's config that RM will use the above ports. This is
> > > because,
> > > > if
> > > > > > RM
> > > > > > > doesn't listens on random ports (as opposed to the above listed
> > > > > standard
> > > > > > > ports), when RM fails over, the new RM gets ports that might be
> > > > > different
> > > > > > > from the ones used by the old RM. This makes the RM's discovery
> > > hard,
> > > > > > > especially post failover.
> > > > > > >    3. It looks like what you are proposing is a way to update
> > > > mesos-dns
> > > > > > as
> > > > > > > to what ports RM's services are listening on. And when RM fails
> > > over,
> > > > > > these
> > > > > > > ports would get updated in mesos-dns. Is my understanding
> > correct?
> > > If
> > > > > > yes,
> > > > > > > one challenge I see is that the clients that want to connect to
> > the
> > > > > above
> > > > > > > listed RM interfaces also need to pull the changes to RM's port
> > > > numbers
> > > > > > > from mesos-dns dynamically. Not sure how that might be
> possible.
> > > > > > >
> > > > > > >  Regarding your question about NM ports
> > > > > > >  1. NM has the following ports:
> > > > > > >      a. RPC port for app masters to launch containers (this is
> a
> > > > > random
> > > > > > > port).
> > > > > > >      b. RPC port for localization service. (port 8040)
> > > > > > >      c. Web port for NM's UI (port 8042).
> > > > > > >    2. Ports (a) and (c) are relayed to RM when NM registers
> with
> > > RM.
> > > > > Port
> > > > > > > (b) is passed to a local container executor process via command
> > > line
> > > > > > args.
> > > > > > >    3. As you rightly reckon, we need a mechanism at launch of
> NM
> > to
> > > > > pass
> > > > > > > the mesos allocated ports to NM for the above interfaces. We
> can
> > > try
> > > > > > > to use variable
> > > > > > > expansion
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/conf/Configuration.html
> > > > > > > >
> > > > > > > mechanism hadoop has to achieve this.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Santosh
> > > > > > >
> > > > > > > On Thu, May 7, 2015 at 3:51 AM, John Omernik <[email protected]
> > > > > > > <javascript:;>> wrote:
> > > > > > >
> > > > > > > > I've implemented mesos-dns and use marathon to launch my
> myriad
> > > > > > > framework.
> > > > > > > > It shows up as myriad.marahon.mesos and makes it easy to find
> > > what
> > > > > node
> > > > > > > the
> > > > > > > > framework launched the resource manager on.
> > > > > > > >
> > > > > > > >  What if we made myriad mesos-dns aware, and prior to
> launching
> > > the
> > > > > > yarn
> > > > > > > > rm, it could register in mesos dns. This would mean both the
> ip
> > > > > > addresses
> > > > > > > > and the ports (we need to figure out multiple ports in
> > > mesos-dns).
> > > > > Then
> > > > > > > it
> > > > > > > > could write out ports and host names in the nm configs by
> > > checking
> > > > > > mesos
> > > > > > > > dns for which ports the resource manager is using.
> > > > > > >
> > > > > > >
> > > > > > > > Side question:  when a node manager registers with the
> resource
> > > > > manager
> > > > > > > > are the ports the nm is running on completely up to the nm?
> Ie
> > I
> > > > can
> > > > > > run
> > > > > > > my
> > > > > > > > nm web server any port, Yarn just explains that to the rm on
> > > > > > > registration?
> > > > > > > > Because then we need a mechanism at launch of the nm task to
> > > > > understand
> > > > > > > > which ports mesos has allocated to the nm and update the
> > > yarn-site
> > > > > for
> > > > > > > that
> > > > > > > > nm before launch.... Perhaps mesos-dns as a requirement isn't
> > > > needed,
> > > > > > > but I
> > > > > > > > am trying to walk through options that get us closer to
> > multiple
> > > > yarn
> > > > > > > > clusters on a mesos cluster.
> > > > > > > >
> > > > > > > > John
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Sent from my iThing
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sent from my iThing
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
>

Re: Recommending or requiring mesos dns?

Reply via email to