Abhishek, Thanks for clarifying and updating the SEP. Cheers! Navina
On Wed, May 3, 2017 at 8:20 PM, Jagadish Venkatraman <jagadish1...@gmail.com > wrote: > Navina, > > > >> The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both > internal > APIs and have a concrete implementations. > > More specifically, both of these are purely internal implementation classes > (and have nothing to do with any pluggable public API that we expose) > > Best, > Jagadish > > On Wed, May 3, 2017 at 7:34 PM, Abhishek Shivanna <abks...@gmail.com> > wrote: > > > Hey Navina, > > > > Thank you for reviewing the SEP. > > > > > Are you planning on exposing this monitor class as a public api? What > is > > the significance of doing so? > > > > Sorry for the confusion of having implementation details under "public > > interfaces". > > The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both > > internal APIs > > and have a concrete implementations. > > > > > Is "Execution Container ID" the name of the environmental variable? I > > don't > > think environmental variables can contain whitespace?? > > > > Again, confusion that stemmed from my initial draft. I have fixed the SEP > > with the actual name in the implementation. > > > > > I think the first sentence corresponds to your design. The second one > is > > more of an implementation detail. You may want to split it up or just > > discard one of them. I got confused reading them together because one > talks > > about adding to container and the other about the ContainerRunner. > > > > Fixed the SEP to make it more clear. > > > > Thanks, > > Abhishek > > > > > > On Wed, May 3, 2017 at 2:08 PM, Navina Ramesh (Apache) < > nav...@apache.org> > > wrote: > > > > > Hi Abhishek, > > > I checked your latest proposal in SEP and it looks good to me. > > > > > > QQ: > > > > A new ContainerHeartbeatMonitor class that accepts a > > > ContainerHeartbeatClient (which has the business logic to make > heartbeat > > > checks on the JC endpoint) and a callback. > > > > > > Are you planning on exposing this monitor class as a public api? What > is > > > the significance of doing so? > > > > > > > set an environment variable with the "Execution Container ID" during > > > container launch. This can be read from the container to make requests > to > > > the above endpoint. > > > > > > Is "Execution Container ID" the name of the environmental variable? I > > don't > > > think environmental variables can contain whitespace?? > > > > > > > On the container side we start a new thread that periodically polls > > this > > > endpoint described above to check if the container is valid. If its > not, > > we > > > shutdown the run loop and raise an error (so that the exit code is non > 0 > > so > > > that YARN reschedules the container) > > > The plan is to setup a monitor in the LocalContainerRunner class that > > > schedules a thread to check the above endpoint at regular intervals. On > > > failure the thread modifies state on the LocalContainerRunner to denote > > > that there was an error. This state is checked during exit in the > > > LocalContainerRunner to exit with a non-zero code. > > > > > > I think the first sentence corresponds to your design. The second one > is > > > more of an implementation detail. You may want to split it up or just > > > discard one of them. I got confused reading them together because one > > talks > > > about adding to container and the other about the ContainerRunner. > > > > > > Design looks pretty elegant and easily portable. > > > > > > Thanks! > > > Navina > > > > > > > > > On Wed, May 3, 2017 at 9:52 AM, Abhishek Shivanna <abks...@gmail.com> > > > wrote: > > > > > > > Hey Jagadish, > > > > > > > > Thank you for taking the time to review the design. > > > > I agree with moving the heartbeat into the the LocalContainerRunner > > > instead > > > > of fitting it into the SamzaContainer. I will update the SEP with the > > new > > > > design changes. > > > > Also agree with the changes to the configuration and choosing > suitable > > > > defaults should be good enough. > > > > > > > > Thanks, > > > > Abhishek > > > > > > > > > > > > > > > > On Wed, Apr 26, 2017 at 3:23 PM, Jagadish Venkatraman < > > > > jagadish1...@gmail.com> wrote: > > > > > > > > > Hi Abhishek, > > > > > > > > > > Heartbeat between the AM and container has been a long awaited > Samza > > > > > feature. It will go a long way in ensuring our reliability! +1 for > > this > > > > > SEP. > > > > > > > > > > *High level comments:* > > > > > > > > > > Currently, the only use-case for the heartbeat mechanism seems to > be > > > when > > > > > running Samza on Yarn. IMHO, it makes sense to pull the heart beat > > > logic > > > > > into the *LocalContainerRunner* instead of baking it into the > > > > > *SamzaContainer* class. Long term, we can re-visit this when we > have > > a > > > > > pluggable liveness detection mechanism. > > > > > > > > > > I'm thinking of a flow like this: > > > > > > > > > > There is a separate component (or a thread) inside > > LocalContainerRunner > > > > > that periodically polls the coordinator, and determines if it > should > > > > > continue running. If the coordinator determines that the container > > > should > > > > > not run, the *LocalContainerRunner* cleanly shuts-down the > container > > > and > > > > > the process exits with a non-zero exit status. > > > > > > > > > > The following nice properties fall out: > > > > > > > > > > - We can remove the proposed config *job.container.validator. > > > enabled. > > > > * > > > > > - We can also remove the proposed *Killable* interface since > > > > > *SamzaContainer* (and runLoops) don't have to implement > *Killable > > * > > > > > anymore. The life-cycle is managed by the *LocalContainerRunner* > > > that > > > > > started it. > > > > > > > > > > *On the proposed public interfaces:* > > > > > > > > > > *job.container.validator.enabled: *I am not in favor of adding > this > > > as > > > > a > > > > > new public config. IIUC, When running Samza jobs on Yarn, we always > > > want > > > > > the validator/heartbeats to be enabled. OTOH, when running Samza > jobs > > > in > > > > > standalone mode, we currently do not have a pluggable mechanism for > > > > > heartbeat. > > > > > > > > > > *job.container.schedule.ms <http://job.container.schedule.ms>: *It > > > does > > > > > seem that we can pick a sensible default, and be done with it > > (instead > > > of > > > > > adding a new config)? Is there a reason this needs to be > > configurable? > > > > > > > > > > *On proposed Killable interface: * > > > > > > > > > > Not entirely sure we need this new "*Killable"* interface (esp. > given > > > > that > > > > > there's currently only one implementation - *SamzaContainer*). > > > > > > > > > > - The *LocalContainerRunner* can instead directly invoke > shut-down > > > on > > > > > the *SamzaContainer* when its heart-beat expires. The extra > level > > of > > > > > indirection (making *SamzaContainer* to implement *Killable*) is > > > > > probably unnecessary IMHO. > > > > > > > > > > > > > > > - Since, the *LocalContainerRunner* invokes *start/run* on the > > > > > *SamzaContainer*, it seems simpler also have it invoke > *shutdown* > > on > > > > the > > > > > *SamzaContainer. * > > > > > > > > > > *Minor Comments:* > > > > > > > > > > >> Expose a REST endpoint (eg: /isContainerValid) who's purpose is > to > > > get > > > > > requests from the Samza container periodically and respond back > > weather > > > > the > > > > > container is in the Job Coordinator's current list of valid > > containers. > > > > > > > > > > Wondering if it'd be slightly cleaner to rename this to > > > > */heartBeatRequest* > > > > > and return a *heartBeatResponse* as *CONTINUE, DIE*. The name > > > > > *isContainerValid > > > > > * and the definition of validity does seem slightly broad? > > > > > > > > > > Thanks again for taking the time to draft the SEP, and volunteering > > to > > > > > implement this. Nice work! > > > > > > > > > > Best, > > > > > Jagadish > > > > > > > > > > On Mon, Apr 24, 2017 at 6:42 PM, Abhishek Shivanna < > > abks...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Everyone, > > > > > > > > > > > > In order to fix the issue of orphaned/leaky containers seen when > > the > > > > > > YARN Node Manager crashes, I have created a SEP discussing the > > design > > > > for > > > > > > implementing a heartbeat between the containers and the job > > > > coordinator: > > > > > > https://cwiki.apache.org/confluence/display/SAMZA/SEP- > > > > > > 3%3A+Heart-beat+mechanism+between+JobCoordinator+and+ > > > > > > all+running+containers > > > > > > > > > > > > Please take a look and provide feedback. I would also really > > > appreciate > > > > > > help in designing a way to propagate the error up from > > SamzaContainer > > > > in > > > > > > order to exit the container with a non-zero exit code. > > > > > > > > > > > > Thanks, > > > > > > Abhishek > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Jagadish V, > > > > > Graduate Student, > > > > > Department of Computer Science, > > > > > Stanford University > > > > > > > > > > > > > > > > > > -- > Jagadish V, > Graduate Student, > Department of Computer Science, > Stanford University >