You are right. I had overlooked the state kept in the Invoker's memory in the ContainerProxy instance via the WarmedData/PrewarmedData instances.
It might need a broader rethinking of the architecture if we want to be able to salvage the warmed/prewarmed containers of a failed-and-replaced ContainerPool instance. Perhaps also an argument for continuing to split the workload across multiple ContainerPool instances even when we are using an underlying cluster-wide scheduler to actually allocate execution resources. Reduce the blast zone when a ContainerPool instance fails... --dave Tyson Norris <[email protected]> wrote on 04/03/2018 12:00:09 PM: > From: Tyson Norris <[email protected]> > To: "[email protected]" <[email protected]> > Date: 04/03/2018 12:00 PM > Subject: Re: Invoker HA on Mesos > > One problem with this (delegating to ContainerFactory to share > prewarm/warm containers to other cluster nodes) is that > ContainerFactory currently is previously ignorant of container state > - and making use of the shared containers requires sharing at least > some of their state (besides paused/running state). Specifically: > - creating a prewarm, the kind needs to be shared > - pausing a warm, the action needs to be shared > > To handle this, the ContainerFactory.createContainer(), > Container.suspend() and Container.resume() would have to change to > propagate this state. > > This seems slightly awkward to me, so want to put it out for feedback. WDYT? > > > > On Mar 30, 2018, at 2:31 PM, David P Grove <[email protected]< > mailto:[email protected]>> wrote: > > > +1. I like this design. > > --dave > > Tyson Norris <[email protected]<mailto:[email protected] > >> wrote on 03/30/2018 01:37:43 PM: > > From: Tyson Norris <[email protected]<mailto:[email protected] > >> > To: "[email protected]<mailto:[email protected]>" > <[email protected]<mailto:[email protected]>> > Date: 03/30/2018 01:37 PM > Subject: Re: Invoker HA on Mesos > > Hooking into pause/unpause/destroy of containers seems plausible, > instead of hooking into the Maps in ContainerPool. > > So in the existing PR, the ContainerPool uses an alternate impl for > Map to store freePool and prewarmPool, and that alternate impl > initiates the attach to existing containers, when it becomes active. > > The ContainerPool could instead potentially delegate to the > ContainerFactory, e.g. a > ContainerFactory.reviveContainers(childFactory) => (freePool, > prewarmPool) - we will still need a way to trigger this on demand > (e.g. when the standby pool becomes active, in our case, but I think > that is a minor detail). > > I can try it out; I will be out next week, but if you test any of > this in the meantime, let me know. > > Thanks > Tyson > > > On Mar 30, 2018, at 9:58 AM, David P Grove <[email protected]< > mailto:[email protected]>> wrote: > > > Tyson Norris <[email protected]<mailto:[email protected] > >> wrote on 03/27/2018 06:25:59 > PM: > > Do you have an example of the labels working? I guess the labels are > changed over time through the lifecycle of the container? > > > Apologies for brutally chopping the email chain; my mail client made a > horrible hash of it. > > Right now, all we are doing with Kube labels is to label each action > container with its owning invoker on startup. This lets us delete > orphaned > containers if the invoker crashes and needs to be restarted. The > labeling > happens at [1] and the removal of orphans using the labels at [2]. > > I think the Kube-native version of part of what you are doing with the > DistributedData for Mesos would be to add and remove additional labels > to > give us the option of attaching a new invoker instance to orphaned > containers instead of just destroying them. Interacting with the > Kubernetes API server to do a labeling operation takes around 10ms, so > we > couldn't do this on a truly hot path. But we could probably afford to > update container labels in parallel with pause/unpause operations, > which > could enable re-attachment to any paused containers. > > --dave > > [1] > https://urldefense.proofpoint.com/v2/url? > u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-26data-3D02-257C01-257Ctnorris-2540adobe.com-257Ca7a6bc14ead944405aad08d59685d4e4-257Cfa7b1b5a7b34438794aed2c178decee1-257C0-257C0-257C636580423906584912-26sdata-3DheMhgQgGqt4ku4hDZuAbKRDw96xQkM7anxlvlhoShs0-253D-26reserved-3D0-3F&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=Fe4FicGBU_20P2yihxV- > apaNSFb6BSj6AlkptSF2gMk&m=_4WtimU6V1851mZlPlrBh6jlZEqL1OovvTrfC8xU_QQ&s=zwj5kFepU_4NbI-- > YSz27EDJFEpj8CvPfxZhNCpBMHw&e= > > u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com< > https://urldefense.proofpoint.com/v2/url? > u=http-3A__3furl-2D3dhttps-2D253a-2D252f-2D252fgithub.com_&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=Fe4FicGBU_20P2yihxV- > apaNSFb6BSj6AlkptSF2gMk&m=_4WtimU6V1851mZlPlrBh6jlZEqL1OovvTrfC8xU_QQ&s=K24BzDS5nSZBV7XCAxpszPcaGTGDMMA0NByAWh0enzo&e= > >-252Fapache-252Fincubator-2Dopenwhisk-252Fblob-252F0b20df0f725a671f8e51c9e8793116476fd22f76-252Fcore-252Finvoker-252Fsrc-252Fmain-252Fscala-252Fwhisk-252Fcore-252Fcontainerpool-252Fkubernetes-252FKubernetesContainerFactory.scala-2523L81-26data-3D02-257C01-257Ctnorris-2540adobe.com< > https://urldefense.proofpoint.com/v2/url? > u=http-3A__252fcore-2D252finvoker-2D252fsrc-2D252fmain-2D252fscala-2D252fwhisk-2D252fcore-2D252fcontainerpool-2D252fkubernetes-2D252fkubernetescontainerfactory.scala-2D2523l81-2D26data-2D3d02-2D257c01-2D257ctnorris-2D2540adobe.com_&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=Fe4FicGBU_20P2yihxV- > apaNSFb6BSj6AlkptSF2gMk&m=_4WtimU6V1851mZlPlrBh6jlZEqL1OovvTrfC8xU_QQ&s=zlTcnGJ7iDpwweWsJYYL3yfHDB5tZe9E3ZYXj9CZXWw&e= > >-257C3ea96a8a416141db52b208d59660052f-257Cfa7b1b5a7b34438794aed2c178decee1-257C0-257C0-257C636580261502275400-26sdata-3D6XagwDT7CnCoj1nOIHK-252B02bincKYogLkKy0vUXh8jY8-253D-26reserved-3D0&d=DwIFAg&c=jf_iaSHvJObTbx- > > siA1ZOg&r=Fe4FicGBU_20P2yihxV- > > apaNSFb6BSj6AlkptSF2gMk&m=4UxWSqFWfs8nhAEogipIZa9x4X7JbRZ5gLfuemvqWQI&s=AiIYyNqL1l96RBLRXVhvdAaIkrJjdZ- > > GRKClR0esbDc&e= > [2] > https://urldefense.proofpoint.com/v2/url? > u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-26data-3D02-257C01-257Ctnorris-2540adobe.com-257Ca7a6bc14ead944405aad08d59685d4e4-257Cfa7b1b5a7b34438794aed2c178decee1-257C0-257C0-257C636580423906584912-26sdata-3DheMhgQgGqt4ku4hDZuAbKRDw96xQkM7anxlvlhoShs0-253D-26reserved-3D0-3F&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=Fe4FicGBU_20P2yihxV- > apaNSFb6BSj6AlkptSF2gMk&m=_4WtimU6V1851mZlPlrBh6jlZEqL1OovvTrfC8xU_QQ&s=zwj5kFepU_4NbI-- > YSz27EDJFEpj8CvPfxZhNCpBMHw&e= > > u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Fgithub.com< > https://urldefense.proofpoint.com/v2/url? > u=http-3A__3furl-2D3dhttps-2D253a-2D252f-2D252fgithub.com_&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=Fe4FicGBU_20P2yihxV- > apaNSFb6BSj6AlkptSF2gMk&m=_4WtimU6V1851mZlPlrBh6jlZEqL1OovvTrfC8xU_QQ&s=K24BzDS5nSZBV7XCAxpszPcaGTGDMMA0NByAWh0enzo&e= > >-252Fapache-252Fincubator-2Dopenwhisk-252Fblob-252F0b20df0f725a671f8e51c9e8793116476fd22f76-252Fcore-252Finvoker-252Fsrc-252Fmain-252Fscala-252Fwhisk-252Fcore-252Fcontainerpool-252Fkubernetes-252FKubernetesContainerFactory.scala-2523L57-26data-3D02-257C01-257Ctnorris-2540adobe.com< > https://urldefense.proofpoint.com/v2/url? > u=http-3A__252fcore-2D252finvoker-2D252fsrc-2D252fmain-2D252fscala-2D252fwhisk-2D252fcore-2D252fcontainerpool-2D252fkubernetes-2D252fkubernetescontainerfactory.scala-2D2523l57-2D26data-2D3d02-2D257c01-2D257ctnorris-2D2540adobe.com_&d=DwIFAg&c=jf_iaSHvJObTbx- > siA1ZOg&r=Fe4FicGBU_20P2yihxV- > apaNSFb6BSj6AlkptSF2gMk&m=_4WtimU6V1851mZlPlrBh6jlZEqL1OovvTrfC8xU_QQ&s=g1paxl5h0H72l4r8qMJton4J7lJCWtsOrL7KtliuO14&e= > >-257C3ea96a8a416141db52b208d59660052f-257Cfa7b1b5a7b34438794aed2c178decee1-257C0-257C0-257C636580261502275400-26sdata-3Df6VQl9UMW7gtoFheibT9opXz973hGUVmivlDJg-252FF5Co-253D-26reserved-3D0&d=DwIFAg&c=jf_iaSHvJObTbx- > > siA1ZOg&r=Fe4FicGBU_20P2yihxV- > > apaNSFb6BSj6AlkptSF2gMk&m=4UxWSqFWfs8nhAEogipIZa9x4X7JbRZ5gLfuemvqWQI&s=ISliBvpYptlv9AhbicWZSFptIleHy1- > > XzCcKuqP7e-0&e= >
