On Thu, 25 Sep 2014 08:49:12 -0400
Sean Dague <s...@dague.net> wrote:

> 
> #1 - tried to get a lock, but someone else has it. Then we know we've
> got lock contention. .
> #2 - something is still holding a lock after some "long" amount of
> time.

+1 to both.

> #2 turned out to be a critical bit in understanding one of the worst
> recent gate impacting issues.
> 
> You can write a tool today that analyzes the logs and shows you these
> things. However, I wonder if we could actually do something creative
> in the code itself to do this already. I'm curious if the creative
> use of Timers might let us emit log messages under the conditions
> above (someone with better understanding of python internals needs to
> speak up here). Maybe it's too much overhead, but I think it's worth
> at least asking the question.

Even a simple log at the end when its finally released if it has been
held for a long time would be handy. As matching up acquire/release by
eye is not easy.

I don't think we get a log message when an acquire is attempted but
fails. That might help get a measure of lock contention?

> The same issue exists when it comes to processutils I think, warning
> that a command is still running after 10s might be really handy,
> because it turns out that issue #2 was caused by this, and it took
> quite a bit of decoding to figure that out.

Also I think that perhaps a log message when a period task takes longer
than the interval that the task is meant to be run would be a useful
warning sign that something odd is going on.

Chris


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to