On 21/09/16 01:43 AM, Zhai, Edwin wrote: > All, > > I'd like make some clarification for the event-alarm timeout design as > many of you have some misunderstanding here. Pls. correct me if any > mistakes. > > I realized that there are 2 different things, but we mix them sometime: > 1. event-timeout-alarm > This is one new type of alarm that bracket *.start and *.end events and > get alarmed when receive *.start but no *.end in timeout. This new alarm > handles one type of events/actions, e.g. create one alarm for instance > creation, then all instances created in future will be handled by this > alarm. This is not for real time, so it's acceptable that user know one > instance creation failure in 5 mins. > > This new type of alarm can be implemented by one worker to check the DB > periodically to do the statistic work. That is, new evaluator works in > 'polling' mode, something like threshold alarm evaluator. > > One BP is @ > https://review.openstack.org/#/c/199005/
we should probably disregard this bp since it was assumed you guys talked over it. i'm abandoning it as i think we just forgot about it. > > 2. event-alarm timeout > This is one new feature for _existed_ event-alarm evaluator. One alarm > becomes 'UNALARM' when not receive desire event in timeout. This feature > just handles one specific event, e.g create one alarm for instance ABC's > XYZ operation with 5s, then user is notified in 5s immediately if no > XYZ.done event comes. If want check for another instance, we need create > another alarm. > > This is used in telco scenario, where operator want know if operation > failure in real time. > > My patch(https://review.openstack.org/#/c/272028/) is for this purpose > only, but I feel many guys mistaken them(sometimes even me) as they > looks similar. So my question is: Do you think this telco usage model of > event-alarm timeout is valid? If not, we can avoid discussing its > implementation and ignore following. > > > =========== event-alarm timeout implementation ============= > As it's for event-alarm, we need keep it as event-driven. Furthermore, > for quick response, we need use event for timeout handling. Periodic > worker can't meet real time requirement. > > Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads > tricky race condition. e.g. 'XYZ.done' comes in queue1, and > 'alarm.timeout.end' comes in queue2, so that they are handled in > parallel way: > > 1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and > will be set ALARM in next step. > 2. In queue2, 'alarm.timeout.end' is checking against same alarm(current > UNKNOWN), and will be set to OK(UNALARM) in next step. > 3. In qeueu1, alarm transition happen: UNKNOWN => ALARM > 4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM) > can you clarify how this work? after user creates event timeout alarm definition through API (i assume the alarm definition specify we should see event x within y seconds). - how does the evaluator get this alarm definition? is there an alarm.timeout.start message? - what is this UNALARM state? to be honest, that isn't a real word so i don't know what it's suppose to represent here. biggest problem for me is the only thing i know is there's a alarm.timeout.end event that needs to be handled by evaluator. i don't know where it's coming from or what it's needed for. > So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells > the user: required event came, then no required event came; > > If put all events in one queue, evaluator handles them one by one(low > level oslo mesg should be multi-threaded) so that second event would see > alarm state as not UNKNOWN, and give up its transition. As Gordc said, > it's slow. But only very small part of the event-alarm need timeout > handling, as it's only for telco usage model. so the multithreaded part is what i was talking about. it's not handling them one by one. it's handling 64 (or whatever the default is) at any given time. whether its' one queue or two, you have a race to handle. > > One possible improvement as JD pointed out is to avoid so many spawned > thread. We can just create one thread inside evaluator, and ask this > thread handle all timeout requests from evaluator. Is it acceptable for > event-alarm timeout solution? > > > Best Rgds, > Edwin -- gord __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev