Hi John, Are you suggesting something like this ?
In issue 96 we are proposing something that will not require port mapping. Can you take a look and give your thoughts https://github.com/mesos/myriad/issues/96 Regards Swapnil On Fri, May 15, 2015 at 6:44 AM, John Omernik <[email protected]> wrote: > This is true. In this setup thought, we wouldn't be using the "random > ports" We'd be assigning the ports that will be used by the RM (the 5) per > cluster (with config changes) a head of time. That is what the RM would > know as its ports. At this point, when marathon spins up a RM, HA proxy > would take the service ports (which would be the same ports the RM "thinks" > is running on) and forward them to the ports that mesos has proxied (in the > available ports list). I've done this in Docker, but not on native marathon > run processes. I need to look into that more. > > One concern I have with the HAProxy is long running TCP connections (I am > not sure if this applies to Yarn/RM) Basically on one particular use case: > Running a Hive Thrift (hiveserver2) service in docker on the mesos cluster > with HAProxy. I found if I submitted a query that was long, that the query > would be submitted, and HAProxy would not seen connections for a while and > kill the proxy to the backend. This was annoying to say the least. Would > this occur with HAProxy? I really think that if the haproxy-marathon bridge > would be used we'd have to be certain that condition wouldn't occur, even > hidden. (I would hate for something to happen where that condition occurs, > however, Yarn is able to "reset" without error, adding a bit of latency to > the process, and have that go unaddressed). > > So other than the HAProxy weirdness I saw, that approach could work, and > then mesos-dns is just a nice component for administrators and users. What > do I mean by that? > > Well, let's say you have a cluster of node1, node2, node3, and node4. > > You assign the 5 yarn ports (and service ports) for that cluster to be > 15000, 15001, 15002, 15003, 15004. > > Myriad starts a node manager. It sets in the RM config (and all NM > configs) the ports based on the 5 above > > Mesos grabs 5 random ports in it's allowed range (default 30000 to 31000) > > When Mesos starts the RM process, lets say it starts it on node2. > > Node2 now has ports 30000,30001,30002,30003,and 30004 listening and is > forwarding those to 15000,15001,15002,15003, and 15004 on the listening > process. (Note, I know this is doable with Docker contained processes, can > Marathon do it outside of docker?) > > Now haproxy's config is updated. on EVERY node, the ports 15000-15004 are > listening and are forwarding to Node2 on ports 30000-30004. > > To your point on "needing" mesos-dns. Technically no, we don't need it. we > can tell our NMs to connect to any node on ports 15000-15004. This will > work. But it's we may get added latency (rack to rack forwarding etc extra > hops). > > Instead, if we set the NMs to connect to myriad-dev-1.marathon.mesos It > could return an IP that is THE node it's running on. That way we get the > advantage of having the NMs connect to the box with the process. HA proxy > takes the requests, and sends to the mesos ports (30000-30004) which Mesos > then sends to the process on ports 15000-15004. > > So without mesos-dns: you just connect to any node on the service ports and > it "works" but when it comes to self documentation, connecting to > myriad-dev-1.marathon.mesos seems more descriptive than saying the NM is on > node2.yourdomain. Especially when it's not... potential for administrative > confusion. > > With mesos-dns, you connect to the descriptive name, and it works. But then > given my concerns with HAProxy, do we even NEED it? All HAProxy is doing at > that point is opening a port on a node, sending to another mesos approved > port only to send it to the same port the process is listening on. Are we > adding complexity? > > This is a great discussion as it speaks to some intrinsic challenges that > exist in data center OSes :) > > > . > > > On Thu, May 14, 2015 at 1:50 PM, Santosh Marella <[email protected]> > wrote: > > > I might be missing something, but I didn't understand why mesos-dns would > > be required in addition to HAProxy. If we configure RM to bind to random > > ports, but have RM reachable via HAProxy on RM's service ports, won't all > > the clients (such as NMs/HiveServer2 etc) just use HAProxy to reach to > RM? > > If yes, why is mesos-dns needed? > > > > I have very limited knowledge about HAProxy configuration in a mesos > > cluster. I just read through this doc: > > https://docs.mesosphere.com/getting-started/service-discovery/ and what > I > > inferred is that a HAProxy instance runs on every slave node and if NM > > running on a slave node has to reach to RM, it would simply use a RM's > > address that looks like "localhost:99999" (where 99999 is a admin > > identified RPC service port for RM). > > Since HAProxy on NM's localhost listens on 99999, it just forwards the > > traffic to RM's IP:RandomPort. Am I understanding this correctly? > > > > Thanks, > > Santosh > > > > On Tue, May 12, 2015 at 5:41 AM, John Omernik <[email protected]> wrote: > > > > > The challenge I think is the ports. So we have 5 ports that are needed > > for > > > a RM, do we predefine those? I think Yuliya is saying yes, we should. > An > > > interesting compromise... rather than truly random ports, when we > > define a > > > Yarn cluster, we have the responsibility to define out 5 "service" > ports > > > using the Martahon/HA Proxy Service ports. (This now requires HA Proxy > as > > > well as mesos-dns. > > > > I'd recommend some work being done on documenting > > > HAProxy for use with the haproxy script, I know that I stumbled a bit > > > trying to get HAProxy setup, but that just may be my own lack of > > knowledge > > > on the subject) These ports will have to be available across the > cluster, > > > and will map to whichever ports Mesos Assigns to the RM. > > > > > > This makes sense to me, a "Yarn Cluster Creation" event on a Mesos > > cluster > > > is something we want to be flexible, but it's not something that will > > > likely be "self service". I.e. we won't have users just creating Yarn > > > clusters at will. It will likely be something that, when requested, the > > > Admin can identify 5 available service ports, and lock those into that > > > cluster... that way when the Yarn RM spins up, it has it's service > ports > > > defined (and thus the Node managers always know which ports to connect > > to). > > > Combined with Mesos DNS, this could actually work out very well, as you > > can > > > the name of the RM can be hard coded, and the ports will just work no > > > matter which node it spins up. > > > > > > From an HA perspective, The only advantage at this point that > > preallocating > > > the failover RM is speed of recovery. (and guarantee of resources > being > > > available if failover occurs). Perhaps we could consider this as an > > option > > > for those who need fast or guaranteed recovery but not make it a > > > requirement? > > > > > > The service port method will not work however for the node manager > ports. > > > That said, I "believe" that as myriad spins up a node manager, it can > > > dynamically allocate the ports, and thus report those to the resource > > > manager on registration. Someone may need to help me out on that one, > as > > I > > > am not sure. Also, since the node manager is host specific, mesos-dns > is > > > not required, it can register to the resource manager with what ever > > ports > > > are allocated, and the hostname it's running on. I guess the question > > here > > > is, when Myriad requests the resources, and mesos allocates the ports, > > can > > > myriad, prior to actually starting the node manager, update the configs > > > with the allocated ports? Or is this even needed? > > > > > > This is a great discussion. > > > > > > On Mon, May 11, 2015 at 9:58 PM, yuliya Feldman > > > <[email protected] > > > > wrote: > > > > > > > As far as I understand in this case Apache YARN RM HA will kick in - > > > which > > > > means all the ids, hosts, ports for all RMs will need to be defined > > > > somewhere and I wonder how it will be defined in this situation since > > > those > > > > either need to be in yarn-site.xml or using "-D". > > > > In case of Mesos-DNS usage no need to setup RM HA at all and no warm > > > > standby needed. Marathon will start RM somewhere in case of failure > and > > > > clients will rediscover it based on the same hostname. > > > > Am I missing anything? > > > > From: Adam Bordelon <[email protected]> > > > > To: [email protected] > > > > Sent: Monday, May 11, 2015 7:26 PM > > > > Subject: Re: Recommending or requiring mesos dns? > > > > > > > > I'm a +1 for random ports. You can also use Marathon's servicePort > > field > > > to > > > > let HAProxy redirect from the servicePort to the actual hostPort for > > the > > > > service on each node. Mesos-DNS will similarly direct you to the > > correct > > > > host:port given the appropriate task name. > > > > > > > > Is there a reason we can't just have Marathon launch two RM tasks for > > the > > > > same YARN cluster? One would be the leader, and the other would > > redirect > > > to > > > > it until failover. Once one fails over, the other will start taking > > > > traffic, and Marathon will try to launch a new backup RM when the > > > resources > > > > are available. If the YARN RM cannot provide us this functionality on > > its > > > > own, perhaps we can write a simple wrapper script for it. > > > > > > > > > > > > > > > > On Fri, May 8, 2015 at 11:57 AM, John Omernik <[email protected]> > > wrote: > > > > > > > > > I would advocate random ports because there should not be a > > limitation > > > > of > > > > > running only one RM per node. If we want true portability, there > > > should > > > > be > > > > > the ability to have RM for the cluster YarnProd to run to run on > > node1 > > > > and > > > > > also have RM for the cluster YarnDev running on Node1. (if it so > > > happens > > > > to > > > > > land this way). That way the number of clusters isn't limited by > the > > > > > number of physical nodes. > > > > > > > > > > On Fri, May 8, 2015 at 1:33 PM, Santosh Marella < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > > > RM can store its data either in HDFS or in ZooKeeper. The data > > store > > > is > > > > > > configurable. There is a config property in YARN > > > > > > (yarn.resourcemanager.recovery.enabled) that tells RM whether it > > > should > > > > > try > > > > > > to recover the metadata about the previously submitted apps, the > > > > > containers > > > > > > allocated to them etc from the state store. > > > > > > > > > > > > Pre allocation of a backup rm is a great idea. Thinking about it > a > > > bit > > > > > > more, I felt it might be better to have such an option available > in > > > > > > Marathon rather than building it in Myriad (and in all > > > > > frameworks/services > > > > > > that wants HA/failover). > > > > > > > > > > > > Let's say we launch a service X via marathon that requires some > > > > > resources > > > > > > (cpus/mem/ports) and we want 1 instance of that service to be > > always > > > > > > available. Marathon promises restart of the service if it goes > > down. > > > > But, > > > > > > as far as I understand, marathon can restart the service on > another > > > > node > > > > > > only if the resources required by service X are available on that > > > node > > > > > > *after* the service goes down. In other words, Marathon doesn't > > > > > proactively > > > > > > "reserve" these resources on another node as a backup for > failover. > > > > > > > > > > > > Again, not all services launched via Marathon requires this, but > > > > perhaps > > > > > > there should be an config option to specify if a service desires > to > > > > have > > > > > > marathon keep a backup node ready-to-go in the event of failure. > > > > > > > > > > > > > > > > > > On Thu, May 7, 2015 at 4:12 PM, John Omernik <[email protected]> > > > wrote: > > > > > > > > > > > > > So I may be lookng at this wrong, but where is the data for the > > rm > > > > > stored > > > > > > > if it does fail over? How will it know to pick up where it left > > > off? > > > > > This > > > > > > > > > > > > is just one area I am low in understanding on. > > > > > > > > > > > > > > > > > > > > > > > > > > > That said, what about pre allocating a second failover rm some > > > where > > > > > on > > > > > > > the cluster. (I am just tossing an idea here, in that there > are > > > > > probably > > > > > > > many reasons not to do this) but here is how I could see it > > > > happening. > > > > > > > > > > > > > 1. Myriad starts a rm asking for 5 random available ports. Mesos > > > > replies > > > > > > > starting the rm and reports to myriad the 5 ports used for the > > > > services > > > > > > you > > > > > > > listed below. > > > > > > > > > > > > > > 2. Myriad then checks a config value of number of "hot spares" > > lets > > > > say > > > > > > we > > > > > > > specify 1. Myriad then puts in a resource request to mesos for > > CPU > > > > and > > > > > > > memory required for the rm, but specifically asks for the same > 5 > > > > ports > > > > > > > allocated to the first. Basically it reserves a spot on another > > > node > > > > > with > > > > > > > the same ports available. It may tak a bit, but there should be > > > that > > > > > > > availability. Until this request is met, the yarn cluster is > in a > > > ha > > > > > > > compromised position. > > > > > > > > > > > > > > > > > > > This is exactly what I think we should do, but why use random > > > ports > > > > > > instead of standard RM ports? If you have 10 slave nodes in your > > > mesos > > > > > > cluster, then there are 10 potential spots for RM to be launched > > on. > > > > > > However, if you choose to launch multiple RMs (multiple YARN > > > clusters), > > > > > > then you can probably launch utmost 5 (with remaining 5 nodes > > > available > > > > > > > > > > > > > > > > > > > > 3. At this point the perhaps we start another instance of rm > > right > > > > away > > > > > > > (depends on my first question on where the rm stores into about > > > > > > > jobs/applications) or the frame work just holds the spot, > waiting > > > > for a > > > > > > > lack of heart beat (failover condition) on the primay resource > > > > manager. > > > > > > > > > > > > > > 4. If we can run the spare with no issues, it's a simple update > > of > > > > the > > > > > > dns > > > > > > > record and node managers connect to the new rm ( and another rm > > is > > > > > > > preallocated for redundancy). If we can't actually execute the > > > > > secondary > > > > > > rm > > > > > > > until failover conditions, we can now execute the new rm, and > the > > > > ports > > > > > > > will be the same. > > > > > > > > > > > > > > This may seem kludgey at first, but done correctly, it may > > actually > > > > > limit > > > > > > > the length of failover time as the rm is preallocated. Rms are > > not > > > > > huge > > > > > > > from a resource perspective thus it may be a small cost for > those > > > who > > > > > > want > > > > > > > failover and multiple clusters (thus having dynamic ports) > > > > > > > > > > > > > > I will keep thinking this through, and would welcome feedback. > > > > > > > > > > > > > > On Thursday, May 7, 2015, Santosh Marella < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi John, > > > > > > > > > > > > > > > > Great views about extending mesos dns for rm's discovery. > Some > > > > > > > thoughts: > > > > > > > > 1. There are 5 primary interfaces RM exposes that are > bound > > to > > > > > > > standard > > > > > > > > ports. > > > > > > > > a. RPC interface for clients that want to submit > > > > applications > > > > > > to > > > > > > > > YARN (port 8032). > > > > > > > > b. RPC interface for NMs to connect back/HB to RM > (port > > > > > 8031). > > > > > > > > c. RPC interface for App Masters to connect back/HB to > > RM > > > > > (port > > > > > > > > 8030). > > > > > > > > d. RPC interface for admin to interact with RM via CLI > > > (port > > > > > > > 8033). > > > > > > > > e. Web Interface for RM's UI (port 8088). > > > > > > > > 2. When we launch RM using Marathon, it's probably better > to > > > > > mention > > > > > > > in > > > > > > > > marathon's config that RM will use the above ports. This is > > > > because, > > > > > if > > > > > > > RM > > > > > > > > doesn't listens on random ports (as opposed to the above > listed > > > > > > standard > > > > > > > > ports), when RM fails over, the new RM gets ports that might > be > > > > > > different > > > > > > > > from the ones used by the old RM. This makes the RM's > discovery > > > > hard, > > > > > > > > especially post failover. > > > > > > > > 3. It looks like what you are proposing is a way to update > > > > > mesos-dns > > > > > > > as > > > > > > > > to what ports RM's services are listening on. And when RM > fails > > > > over, > > > > > > > these > > > > > > > > ports would get updated in mesos-dns. Is my understanding > > > correct? > > > > If > > > > > > > yes, > > > > > > > > one challenge I see is that the clients that want to connect > to > > > the > > > > > > above > > > > > > > > listed RM interfaces also need to pull the changes to RM's > port > > > > > numbers > > > > > > > > from mesos-dns dynamically. Not sure how that might be > > possible. > > > > > > > > > > > > > > > > Regarding your question about NM ports > > > > > > > > 1. NM has the following ports: > > > > > > > > a. RPC port for app masters to launch containers (this > is > > a > > > > > > random > > > > > > > > port). > > > > > > > > b. RPC port for localization service. (port 8040) > > > > > > > > c. Web port for NM's UI (port 8042). > > > > > > > > 2. Ports (a) and (c) are relayed to RM when NM registers > > with > > > > RM. > > > > > > Port > > > > > > > > (b) is passed to a local container executor process via > command > > > > line > > > > > > > args. > > > > > > > > 3. As you rightly reckon, we need a mechanism at launch of > > NM > > > to > > > > > > pass > > > > > > > > the mesos allocated ports to NM for the above interfaces. We > > can > > > > try > > > > > > > > to use variable > > > > > > > > expansion > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/conf/Configuration.html > > > > > > > > > > > > > > > > > mechanism hadoop has to achieve this. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Santosh > > > > > > > > > > > > > > > > On Thu, May 7, 2015 at 3:51 AM, John Omernik < > [email protected] > > > > > > > > <javascript:;>> wrote: > > > > > > > > > > > > > > > > > I've implemented mesos-dns and use marathon to launch my > > myriad > > > > > > > > framework. > > > > > > > > > It shows up as myriad.marahon.mesos and makes it easy to > find > > > > what > > > > > > node > > > > > > > > the > > > > > > > > > framework launched the resource manager on. > > > > > > > > > > > > > > > > > > What if we made myriad mesos-dns aware, and prior to > > launching > > > > the > > > > > > > yarn > > > > > > > > > rm, it could register in mesos dns. This would mean both > the > > ip > > > > > > > addresses > > > > > > > > > and the ports (we need to figure out multiple ports in > > > > mesos-dns). > > > > > > Then > > > > > > > > it > > > > > > > > > could write out ports and host names in the nm configs by > > > > checking > > > > > > > mesos > > > > > > > > > dns for which ports the resource manager is using. > > > > > > > > > > > > > > > > > > > > > > > > > Side question: when a node manager registers with the > > resource > > > > > > manager > > > > > > > > > are the ports the nm is running on completely up to the nm? > > Ie > > > I > > > > > can > > > > > > > run > > > > > > > > my > > > > > > > > > nm web server any port, Yarn just explains that to the rm > on > > > > > > > > registration? > > > > > > > > > Because then we need a mechanism at launch of the nm task > to > > > > > > understand > > > > > > > > > which ports mesos has allocated to the nm and update the > > > > yarn-site > > > > > > for > > > > > > > > that > > > > > > > > > nm before launch.... Perhaps mesos-dns as a requirement > isn't > > > > > needed, > > > > > > > > but I > > > > > > > > > am trying to walk through options that get us closer to > > > multiple > > > > > yarn > > > > > > > > > clusters on a mesos cluster. > > > > > > > > > > > > > > > > > > John > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Sent from my iThing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Sent from my iThing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
