On 21/09/16 01:43 AM, Zhai, Edwin wrote:
> I'd like make some clarification for the event-alarm timeout design as
> many of you have some misunderstanding here. Pls. correct me if any
> I realized that there are 2 different things, but we mix them sometime:
> 1. event-timeout-alarm
> This is one new type of alarm that bracket *.start and *.end events and
> get alarmed when receive *.start but no *.end in timeout. This new alarm
> handles one type of events/actions, e.g. create one alarm for instance
> creation, then all instances created in future will be handled by this
> alarm. This is not for real time, so it's acceptable that user know one
> instance creation failure in 5 mins.
> This new type of alarm can be implemented by one worker to check the DB
> periodically to do the statistic work. That is, new evaluator works in
> 'polling' mode, something like threshold alarm evaluator.
> One BP is @
we should probably disregard this bp since it was assumed you guys
talked over it. i'm abandoning it as i think we just forgot about it.
> 2. event-alarm timeout
> This is one new feature for _existed_ event-alarm evaluator. One alarm
> becomes 'UNALARM' when not receive desire event in timeout. This feature
> just handles one specific event, e.g create one alarm for instance ABC's
> XYZ operation with 5s, then user is notified in 5s immediately if no
> XYZ.done event comes. If want check for another instance, we need create
> another alarm.
> This is used in telco scenario, where operator want know if operation
> failure in real time.
> My patch(https://review.openstack.org/#/c/272028/) is for this purpose
> only, but I feel many guys mistaken them(sometimes even me) as they
> looks similar. So my question is: Do you think this telco usage model of
> event-alarm timeout is valid? If not, we can avoid discussing its
> implementation and ignore following.
> =========== event-alarm timeout implementation =============
> As it's for event-alarm, we need keep it as event-driven. Furthermore,
> for quick response, we need use event for timeout handling. Periodic
> worker can't meet real time requirement.
> Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads
> tricky race condition. e.g. 'XYZ.done' comes in queue1, and
> 'alarm.timeout.end' comes in queue2, so that they are handled in
> parallel way:
> 1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and
> will be set ALARM in next step.
> 2. In queue2, 'alarm.timeout.end' is checking against same alarm(current
> UNKNOWN), and will be set to OK(UNALARM) in next step.
> 3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
> 4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)
can you clarify how this work? after user creates event timeout alarm
definition through API (i assume the alarm definition specify we should
see event x within y seconds).
- how does the evaluator get this alarm definition? is there an
- what is this UNALARM state? to be honest, that isn't a real word so i
don't know what it's suppose to represent here.
biggest problem for me is the only thing i know is there's a
alarm.timeout.end event that needs to be handled by evaluator. i don't
know where it's coming from or what it's needed for.
> So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells
> the user: required event came, then no required event came;
> If put all events in one queue, evaluator handles them one by one(low
> level oslo mesg should be multi-threaded) so that second event would see
> alarm state as not UNKNOWN, and give up its transition. As Gordc said,
> it's slow. But only very small part of the event-alarm need timeout
> handling, as it's only for telco usage model.
so the multithreaded part is what i was talking about. it's not handling
them one by one. it's handling 64 (or whatever the default is) at any
given time. whether its' one queue or two, you have a race to handle.
> One possible improvement as JD pointed out is to avoid so many spawned
> thread. We can just create one thread inside evaluator, and ask this
> thread handle all timeout requests from evaluator. Is it acceptable for
> event-alarm timeout solution?
> Best Rgds,
OpenStack Development Mailing List (not for usage questions)