Dimitrios Apostolou wrote:
Hello,
in my program using LMDB, I've experienced rare deadlocks in highly
concurrent mixed (read/write/cursor iteration) workloads. The end result
is that hundreds of threads are hanging waiting on LOCK_MUTEX_W().
Unfortunately I'm not quite sure why this happens.
If my understanding is correct, this mutex is locked from the beginning of
the transaction, until the commit/abort, effectively serialising writers.
So I assume that somehow a writer dies or is violently killed, so he is
not able to run its atexit() cleanups, and this shared mutex remains
locked forever.
What would you suggest for such a situation? I'm thinking of patching LMDB
to lock with mutex_timedwait() and periodically check if the PID having
taken the mutex is still alive. Is the writer PID stored somewhere, or a
change of format will be needed? Any other ideas are welcome!
We have a patch to use robust mutexes. They're a few percent slower but will
allow recovery from this situation.
But aside from that, either your software has a bug, or someone is messing
with your system, and you need to find out what's going on and stop that.
Thanks in advance,
Dimitris
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/