I don't want to be negative, in that in concept, the idea has merit. That said, I am extremely concerned about performance. If there is a x% performance hit on this, and there is another method that may take more work, but not have the performance hit, I think we should focus on that. I understand that there may be "smallish" applications, that may work for this, however, I think it's a danger of scale in that while it may work on a small scale in dev/testing, someone who tried that approach and then TRIES to scale may be severely disappointed.
On Wed, May 20, 2015 at 8:49 PM, Swapnil Daingade < [email protected]> wrote: > Trying to send image again. This time as attachment. > > Regards > Swapnil > > > On Wed, May 20, 2015 at 5:43 PM, Swapnil Daingade < > [email protected]> wrote: > >> Hi John, >> >> Are you suggesting something like this ? >> >> In issue 96 we are proposing something that will not require port mapping. >> Can you take a look and give your thoughts >> https://github.com/mesos/myriad/issues/96 >> >> Regards >> Swapnil >> >> >> >> On Fri, May 15, 2015 at 6:44 AM, John Omernik <[email protected]> wrote: >> >>> This is true. In this setup thought, we wouldn't be using the "random >>> ports" We'd be assigning the ports that will be used by the RM (the 5) >>> per >>> cluster (with config changes) a head of time. That is what the RM would >>> know as its ports. At this point, when marathon spins up a RM, HA proxy >>> would take the service ports (which would be the same ports the RM >>> "thinks" >>> is running on) and forward them to the ports that mesos has proxied (in >>> the >>> available ports list). I've done this in Docker, but not on native >>> marathon >>> run processes. I need to look into that more. >>> >>> One concern I have with the HAProxy is long running TCP connections (I am >>> not sure if this applies to Yarn/RM) Basically on one particular use >>> case: >>> Running a Hive Thrift (hiveserver2) service in docker on the mesos >>> cluster >>> with HAProxy. I found if I submitted a query that was long, that the >>> query >>> would be submitted, and HAProxy would not seen connections for a while >>> and >>> kill the proxy to the backend. This was annoying to say the least. >>> Would >>> this occur with HAProxy? I really think that if the haproxy-marathon >>> bridge >>> would be used we'd have to be certain that condition wouldn't occur, even >>> hidden. (I would hate for something to happen where that condition >>> occurs, >>> however, Yarn is able to "reset" without error, adding a bit of latency >>> to >>> the process, and have that go unaddressed). >>> >>> So other than the HAProxy weirdness I saw, that approach could work, and >>> then mesos-dns is just a nice component for administrators and users. >>> What >>> do I mean by that? >>> >>> Well, let's say you have a cluster of node1, node2, node3, and node4. >>> >>> You assign the 5 yarn ports (and service ports) for that cluster to be >>> 15000, 15001, 15002, 15003, 15004. >>> >>> Myriad starts a node manager. It sets in the RM config (and all NM >>> configs) the ports based on the 5 above >>> >>> Mesos grabs 5 random ports in it's allowed range (default 30000 to 31000) >>> >>> When Mesos starts the RM process, lets say it starts it on node2. >>> >>> Node2 now has ports 30000,30001,30002,30003,and 30004 listening and is >>> forwarding those to 15000,15001,15002,15003, and 15004 on the listening >>> process. (Note, I know this is doable with Docker contained processes, >>> can >>> Marathon do it outside of docker?) >>> >>> Now haproxy's config is updated. on EVERY node, the ports 15000-15004 are >>> listening and are forwarding to Node2 on ports 30000-30004. >>> >>> To your point on "needing" mesos-dns. Technically no, we don't need it. >>> we >>> can tell our NMs to connect to any node on ports 15000-15004. This will >>> work. But it's we may get added latency (rack to rack forwarding etc >>> extra >>> hops). >>> >>> Instead, if we set the NMs to connect to myriad-dev-1.marathon.mesos It >>> could return an IP that is THE node it's running on. That way we get the >>> advantage of having the NMs connect to the box with the process. HA >>> proxy >>> takes the requests, and sends to the mesos ports (30000-30004) which >>> Mesos >>> then sends to the process on ports 15000-15004. >>> >>> So without mesos-dns: you just connect to any node on the service ports >>> and >>> it "works" but when it comes to self documentation, connecting to >>> myriad-dev-1.marathon.mesos seems more descriptive than saying the NM is >>> on >>> node2.yourdomain. Especially when it's not... potential for >>> administrative >>> confusion. >>> >>> With mesos-dns, you connect to the descriptive name, and it works. But >>> then >>> given my concerns with HAProxy, do we even NEED it? All HAProxy is doing >>> at >>> that point is opening a port on a node, sending to another mesos approved >>> port only to send it to the same port the process is listening on. Are we >>> adding complexity? >>> >>> This is a great discussion as it speaks to some intrinsic challenges that >>> exist in data center OSes :) >>> >>> >>> . >>> >>> >>> On Thu, May 14, 2015 at 1:50 PM, Santosh Marella <[email protected]> >>> wrote: >>> >>> > I might be missing something, but I didn't understand why mesos-dns >>> would >>> > be required in addition to HAProxy. If we configure RM to bind to >>> random >>> > ports, but have RM reachable via HAProxy on RM's service ports, won't >>> all >>> > the clients (such as NMs/HiveServer2 etc) just use HAProxy to reach to >>> RM? >>> > If yes, why is mesos-dns needed? >>> > >>> > I have very limited knowledge about HAProxy configuration in a mesos >>> > cluster. I just read through this doc: >>> > https://docs.mesosphere.com/getting-started/service-discovery/ and >>> what I >>> > inferred is that a HAProxy instance runs on every slave node and if NM >>> > running on a slave node has to reach to RM, it would simply use a RM's >>> > address that looks like "localhost:99999" (where 99999 is a admin >>> > identified RPC service port for RM). >>> > Since HAProxy on NM's localhost listens on 99999, it just forwards the >>> > traffic to RM's IP:RandomPort. Am I understanding this correctly? >>> > >>> > Thanks, >>> > Santosh >>> > >>> > On Tue, May 12, 2015 at 5:41 AM, John Omernik <[email protected]> >>> wrote: >>> > >>> > > The challenge I think is the ports. So we have 5 ports that are >>> needed >>> > for >>> > > a RM, do we predefine those? I think Yuliya is saying yes, we >>> should. An >>> > > interesting compromise... rather than truly random ports, when we >>> > define a >>> > > Yarn cluster, we have the responsibility to define out 5 "service" >>> ports >>> > > using the Martahon/HA Proxy Service ports. (This now requires HA >>> Proxy as >>> > > well as mesos-dns. >>> > >>> > I'd recommend some work being done on documenting >>> > > HAProxy for use with the haproxy script, I know that I stumbled a bit >>> > > trying to get HAProxy setup, but that just may be my own lack of >>> > knowledge >>> > > on the subject) These ports will have to be available across the >>> cluster, >>> > > and will map to whichever ports Mesos Assigns to the RM. >>> > > >>> > > This makes sense to me, a "Yarn Cluster Creation" event on a Mesos >>> > cluster >>> > > is something we want to be flexible, but it's not something that will >>> > > likely be "self service". I.e. we won't have users just creating Yarn >>> > > clusters at will. It will likely be something that, when requested, >>> the >>> > > Admin can identify 5 available service ports, and lock those into >>> that >>> > > cluster... that way when the Yarn RM spins up, it has it's service >>> ports >>> > > defined (and thus the Node managers always know which ports to >>> connect >>> > to). >>> > > Combined with Mesos DNS, this could actually work out very well, as >>> you >>> > can >>> > > the name of the RM can be hard coded, and the ports will just work no >>> > > matter which node it spins up. >>> > > >>> > > From an HA perspective, The only advantage at this point that >>> > preallocating >>> > > the failover RM is speed of recovery. (and guarantee of resources >>> being >>> > > available if failover occurs). Perhaps we could consider this as an >>> > option >>> > > for those who need fast or guaranteed recovery but not make it a >>> > > requirement? >>> > > >>> > > The service port method will not work however for the node manager >>> ports. >>> > > That said, I "believe" that as myriad spins up a node manager, it can >>> > > dynamically allocate the ports, and thus report those to the resource >>> > > manager on registration. Someone may need to help me out on that >>> one, as >>> > I >>> > > am not sure. Also, since the node manager is host specific, >>> mesos-dns is >>> > > not required, it can register to the resource manager with what ever >>> > ports >>> > > are allocated, and the hostname it's running on. I guess the >>> question >>> > here >>> > > is, when Myriad requests the resources, and mesos allocates the >>> ports, >>> > can >>> > > myriad, prior to actually starting the node manager, update the >>> configs >>> > > with the allocated ports? Or is this even needed? >>> > > >>> > > This is a great discussion. >>> > > >>> > > On Mon, May 11, 2015 at 9:58 PM, yuliya Feldman >>> > > <[email protected] >>> > > > wrote: >>> > > >>> > > > As far as I understand in this case Apache YARN RM HA will kick in >>> - >>> > > which >>> > > > means all the ids, hosts, ports for all RMs will need to be defined >>> > > > somewhere and I wonder how it will be defined in this situation >>> since >>> > > those >>> > > > either need to be in yarn-site.xml or using "-D". >>> > > > In case of Mesos-DNS usage no need to setup RM HA at all and no >>> warm >>> > > > standby needed. Marathon will start RM somewhere in case of >>> failure and >>> > > > clients will rediscover it based on the same hostname. >>> > > > Am I missing anything? >>> > > > From: Adam Bordelon <[email protected]> >>> > > > To: [email protected] >>> > > > Sent: Monday, May 11, 2015 7:26 PM >>> > > > Subject: Re: Recommending or requiring mesos dns? >>> > > > >>> > > > I'm a +1 for random ports. You can also use Marathon's servicePort >>> > field >>> > > to >>> > > > let HAProxy redirect from the servicePort to the actual hostPort >>> for >>> > the >>> > > > service on each node. Mesos-DNS will similarly direct you to the >>> > correct >>> > > > host:port given the appropriate task name. >>> > > > >>> > > > Is there a reason we can't just have Marathon launch two RM tasks >>> for >>> > the >>> > > > same YARN cluster? One would be the leader, and the other would >>> > redirect >>> > > to >>> > > > it until failover. Once one fails over, the other will start taking >>> > > > traffic, and Marathon will try to launch a new backup RM when the >>> > > resources >>> > > > are available. If the YARN RM cannot provide us this functionality >>> on >>> > its >>> > > > own, perhaps we can write a simple wrapper script for it. >>> > > > >>> > > > >>> > > > >>> > > > On Fri, May 8, 2015 at 11:57 AM, John Omernik <[email protected]> >>> > wrote: >>> > > > >>> > > > > I would advocate random ports because there should not be a >>> > limitation >>> > > > of >>> > > > > running only one RM per node. If we want true portability, there >>> > > should >>> > > > be >>> > > > > the ability to have RM for the cluster YarnProd to run to run on >>> > node1 >>> > > > and >>> > > > > also have RM for the cluster YarnDev running on Node1. (if it so >>> > > happens >>> > > > to >>> > > > > land this way). That way the number of clusters isn't limited >>> by the >>> > > > > number of physical nodes. >>> > > > > >>> > > > > On Fri, May 8, 2015 at 1:33 PM, Santosh Marella < >>> > [email protected] >>> > > > >>> > > > > wrote: >>> > > > > >>> > > > > > RM can store its data either in HDFS or in ZooKeeper. The data >>> > store >>> > > is >>> > > > > > configurable. There is a config property in YARN >>> > > > > > (yarn.resourcemanager.recovery.enabled) that tells RM whether >>> it >>> > > should >>> > > > > try >>> > > > > > to recover the metadata about the previously submitted apps, >>> the >>> > > > > containers >>> > > > > > allocated to them etc from the state store. >>> > > > > > >>> > > > > > Pre allocation of a backup rm is a great idea. Thinking about >>> it a >>> > > bit >>> > > > > > more, I felt it might be better to have such an option >>> available in >>> > > > > > Marathon rather than building it in Myriad (and in all >>> > > > > frameworks/services >>> > > > > > that wants HA/failover). >>> > > > > > >>> > > > > > Let's say we launch a service X via marathon that requires >>> some >>> > > > > resources >>> > > > > > (cpus/mem/ports) and we want 1 instance of that service to be >>> > always >>> > > > > > available. Marathon promises restart of the service if it goes >>> > down. >>> > > > But, >>> > > > > > as far as I understand, marathon can restart the service on >>> another >>> > > > node >>> > > > > > only if the resources required by service X are available on >>> that >>> > > node >>> > > > > > *after* the service goes down. In other words, Marathon doesn't >>> > > > > proactively >>> > > > > > "reserve" these resources on another node as a backup for >>> failover. >>> > > > > > >>> > > > > > Again, not all services launched via Marathon requires this, >>> but >>> > > > perhaps >>> > > > > > there should be an config option to specify if a service >>> desires to >>> > > > have >>> > > > > > marathon keep a backup node ready-to-go in the event of >>> failure. >>> > > > > > >>> > > > > > >>> > > > > > On Thu, May 7, 2015 at 4:12 PM, John Omernik <[email protected] >>> > >>> > > wrote: >>> > > > > > >>> > > > > > > So I may be lookng at this wrong, but where is the data for >>> the >>> > rm >>> > > > > stored >>> > > > > > > if it does fail over? How will it know to pick up where it >>> left >>> > > off? >>> > > > > This >>> > > > > > >>> > > > > > is just one area I am low in understanding on. >>> > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > > > That said, what about pre allocating a second failover rm >>> some >>> > > where >>> > > > > on >>> > > > > > > the cluster. (I am just tossing an idea here, in that there >>> are >>> > > > > probably >>> > > > > > > many reasons not to do this) but here is how I could see it >>> > > > happening. >>> > > > > > > >>> > > > > > 1. Myriad starts a rm asking for 5 random available ports. >>> Mesos >>> > > > replies >>> > > > > > > starting the rm and reports to myriad the 5 ports used for >>> the >>> > > > services >>> > > > > > you >>> > > > > > > listed below. >>> > > > > > > >>> > > > > > > 2. Myriad then checks a config value of number of "hot >>> spares" >>> > lets >>> > > > say >>> > > > > > we >>> > > > > > > specify 1. Myriad then puts in a resource request to mesos >>> for >>> > CPU >>> > > > and >>> > > > > > > memory required for the rm, but specifically asks for the >>> same 5 >>> > > > ports >>> > > > > > > allocated to the first. Basically it reserves a spot on >>> another >>> > > node >>> > > > > with >>> > > > > > > the same ports available. It may tak a bit, but there should >>> be >>> > > that >>> > > > > > > availability. Until this request is met, the yarn cluster is >>> in a >>> > > ha >>> > > > > > > compromised position. >>> > > > > > > >>> > > > > > >>> > > > > > This is exactly what I think we should do, but why use >>> random >>> > > ports >>> > > > > > instead of standard RM ports? If you have 10 slave nodes in >>> your >>> > > mesos >>> > > > > > cluster, then there are 10 potential spots for RM to be >>> launched >>> > on. >>> > > > > > However, if you choose to launch multiple RMs (multiple YARN >>> > > clusters), >>> > > > > > then you can probably launch utmost 5 (with remaining 5 nodes >>> > > available >>> > > > > > >>> > > > > > > >>> > > > > > > 3. At this point the perhaps we start another instance of rm >>> > right >>> > > > away >>> > > > > > > (depends on my first question on where the rm stores into >>> about >>> > > > > > > jobs/applications) or the frame work just holds the spot, >>> waiting >>> > > > for a >>> > > > > > > lack of heart beat (failover condition) on the primay >>> resource >>> > > > manager. >>> > > > > > > >>> > > > > > > 4. If we can run the spare with no issues, it's a simple >>> update >>> > of >>> > > > the >>> > > > > > dns >>> > > > > > > record and node managers connect to the new rm ( and another >>> rm >>> > is >>> > > > > > > preallocated for redundancy). If we can't actually execute >>> the >>> > > > > secondary >>> > > > > > rm >>> > > > > > > until failover conditions, we can now execute the new rm, >>> and the >>> > > > ports >>> > > > > > > will be the same. >>> > > > > > > >>> > > > > > > This may seem kludgey at first, but done correctly, it may >>> > actually >>> > > > > limit >>> > > > > > > the length of failover time as the rm is preallocated. Rms >>> are >>> > not >>> > > > > huge >>> > > > > > > from a resource perspective thus it may be a small cost for >>> those >>> > > who >>> > > > > > want >>> > > > > > > failover and multiple clusters (thus having dynamic ports) >>> > > > > > > >>> > > > > > > I will keep thinking this through, and would welcome >>> feedback. >>> > > > > > > >>> > > > > > > On Thursday, May 7, 2015, Santosh Marella < >>> [email protected] >>> > > >>> > > > > wrote: >>> > > > > > > >>> > > > > > > > Hi John, >>> > > > > > > > >>> > > > > > > > Great views about extending mesos dns for rm's discovery. >>> Some >>> > > > > > > thoughts: >>> > > > > > > > 1. There are 5 primary interfaces RM exposes that are >>> bound >>> > to >>> > > > > > > standard >>> > > > > > > > ports. >>> > > > > > > > a. RPC interface for clients that want to submit >>> > > > applications >>> > > > > > to >>> > > > > > > > YARN (port 8032). >>> > > > > > > > b. RPC interface for NMs to connect back/HB to RM >>> (port >>> > > > > 8031). >>> > > > > > > > c. RPC interface for App Masters to connect back/HB >>> to >>> > RM >>> > > > > (port >>> > > > > > > > 8030). >>> > > > > > > > d. RPC interface for admin to interact with RM via >>> CLI >>> > > (port >>> > > > > > > 8033). >>> > > > > > > > e. Web Interface for RM's UI (port 8088). >>> > > > > > > > 2. When we launch RM using Marathon, it's probably >>> better to >>> > > > > mention >>> > > > > > > in >>> > > > > > > > marathon's config that RM will use the above ports. This is >>> > > > because, >>> > > > > if >>> > > > > > > RM >>> > > > > > > > doesn't listens on random ports (as opposed to the above >>> listed >>> > > > > > standard >>> > > > > > > > ports), when RM fails over, the new RM gets ports that >>> might be >>> > > > > > different >>> > > > > > > > from the ones used by the old RM. This makes the RM's >>> discovery >>> > > > hard, >>> > > > > > > > especially post failover. >>> > > > > > > > 3. It looks like what you are proposing is a way to >>> update >>> > > > > mesos-dns >>> > > > > > > as >>> > > > > > > > to what ports RM's services are listening on. And when RM >>> fails >>> > > > over, >>> > > > > > > these >>> > > > > > > > ports would get updated in mesos-dns. Is my understanding >>> > > correct? >>> > > > If >>> > > > > > > yes, >>> > > > > > > > one challenge I see is that the clients that want to >>> connect to >>> > > the >>> > > > > > above >>> > > > > > > > listed RM interfaces also need to pull the changes to RM's >>> port >>> > > > > numbers >>> > > > > > > > from mesos-dns dynamically. Not sure how that might be >>> > possible. >>> > > > > > > > >>> > > > > > > > Regarding your question about NM ports >>> > > > > > > > 1. NM has the following ports: >>> > > > > > > > a. RPC port for app masters to launch containers >>> (this is >>> > a >>> > > > > > random >>> > > > > > > > port). >>> > > > > > > > b. RPC port for localization service. (port 8040) >>> > > > > > > > c. Web port for NM's UI (port 8042). >>> > > > > > > > 2. Ports (a) and (c) are relayed to RM when NM registers >>> > with >>> > > > RM. >>> > > > > > Port >>> > > > > > > > (b) is passed to a local container executor process via >>> command >>> > > > line >>> > > > > > > args. >>> > > > > > > > 3. As you rightly reckon, we need a mechanism at launch >>> of >>> > NM >>> > > to >>> > > > > > pass >>> > > > > > > > the mesos allocated ports to NM for the above interfaces. >>> We >>> > can >>> > > > try >>> > > > > > > > to use variable >>> > > > > > > > expansion >>> > > > > > > > < >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/conf/Configuration.html >>> > > > > > > > > >>> > > > > > > > mechanism hadoop has to achieve this. >>> > > > > > > > >>> > > > > > > > Thanks, >>> > > > > > > > Santosh >>> > > > > > > > >>> > > > > > > > On Thu, May 7, 2015 at 3:51 AM, John Omernik < >>> [email protected] >>> > > > > > > > <javascript:;>> wrote: >>> > > > > > > > >>> > > > > > > > > I've implemented mesos-dns and use marathon to launch my >>> > myriad >>> > > > > > > > framework. >>> > > > > > > > > It shows up as myriad.marahon.mesos and makes it easy to >>> find >>> > > > what >>> > > > > > node >>> > > > > > > > the >>> > > > > > > > > framework launched the resource manager on. >>> > > > > > > > > >>> > > > > > > > > What if we made myriad mesos-dns aware, and prior to >>> > launching >>> > > > the >>> > > > > > > yarn >>> > > > > > > > > rm, it could register in mesos dns. This would mean both >>> the >>> > ip >>> > > > > > > addresses >>> > > > > > > > > and the ports (we need to figure out multiple ports in >>> > > > mesos-dns). >>> > > > > > Then >>> > > > > > > > it >>> > > > > > > > > could write out ports and host names in the nm configs by >>> > > > checking >>> > > > > > > mesos >>> > > > > > > > > dns for which ports the resource manager is using. >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > > Side question: when a node manager registers with the >>> > resource >>> > > > > > manager >>> > > > > > > > > are the ports the nm is running on completely up to the >>> nm? >>> > Ie >>> > > I >>> > > > > can >>> > > > > > > run >>> > > > > > > > my >>> > > > > > > > > nm web server any port, Yarn just explains that to the >>> rm on >>> > > > > > > > registration? >>> > > > > > > > > Because then we need a mechanism at launch of the nm >>> task to >>> > > > > > understand >>> > > > > > > > > which ports mesos has allocated to the nm and update the >>> > > > yarn-site >>> > > > > > for >>> > > > > > > > that >>> > > > > > > > > nm before launch.... Perhaps mesos-dns as a requirement >>> isn't >>> > > > > needed, >>> > > > > > > > but I >>> > > > > > > > > am trying to walk through options that get us closer to >>> > > multiple >>> > > > > yarn >>> > > > > > > > > clusters on a mesos cluster. >>> > > > > > > > > >>> > > > > > > > > John >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > -- >>> > > > > > > > > Sent from my iThing >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > -- >>> > > > > > > Sent from my iThing >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > >>> > >>> >> >> >
