Every day or so my web server is hanging due to an Elephant/BDB issue. I believe the BDB documentation has a fix for the problem that I'm working to implement.
The symptoms are as follows: The server uses 99% CPU as hundreds of threads continue to run. It appears that each is trying to access objects in the database, but these database operations are all blocking. A look at db_stat shows that there are 300+ active transactions, and a look at sb-thread:list-all-threads seems to confirm this. When I attempt to run a database query (listing objects of a certain class) from the REPL, that operation blocks as well. Strangely, when I attempt an (ele:get-from-root :question-number) I get an error: There is no applicable method for the generic function #<STANDARD-GENERIC-FUNCTION ELEPHANT:GET-VALUE (2)> when called with arguments (:QUESTION-NUMBER NIL). [Condition of type SIMPLE-ERROR] Restarts: 0: [RETRY] Retry SLIME REPL evaluation request. 1: [ABORT] Return to SLIME's top level. 2: [TERMINATE-THREAD] Terminate this thread (#<THREAD "new-repl-thread" RUNNING {1003A1BD91}>) Backtrace: 0: ((SB-PCL::FAST-METHOD NO-APPLICABLE-METHOD (T)) #<unavailable argument> #<unavailable argument> #<ST.. 1: (SB-INT:SIMPLE-EVAL-IN-LEXENV (ELEPHANT:GET-FROM-ROOT :QUESTION-NUMBER) #<NULL-LEXENV>) 2: (SWANK::EVAL-REGION "(ele:get-from-root I took at look at the BDB FAQ and the following item seems relevant: A transactional database environment is hanging, and no threads of control are making progress. The most common cause of this failure is a thread of control exiting unexpectedly, while holding a Berkeley DB mutex or a read/write logical database lock. If a thread of control exits holding a data structure mutex, other threads of control will likely lock up fairly quickly, queued behind the mutex. If a thread of control exits holding a logical database lock, other threads of control may lock up over a long period of time, as they will not be blocked until they attempt to acquire the specific page for which a lock is not available. See the "Deadlock debugging" section of the Berkeley DB Reference Guide for more information on debugging deadlocks. Whenever a thread of control exits m4_db holding a mutex or logical lock, the failure must be resolved. See the "Handling failure in Transactional Data Store applications" section of the Berkeley DB Reference Guide for more information. Finally, the Berkeley DB API is not re-entrant, and it is usually unsafe for signal handlers to call the Berkeley DB methods. See the "Signal handling" section of the Berkeley DB Reference Guide for more information. --- The solution to this problem seems to be to use DB_ENV->failchk to occasionally check for threads that have terminated without closing locks or mutexes. However, I'm not sure how this should ever occur given the UNWIND-PROTECT clauses in the current elephant system. What do you all make of this situation? The next step to resolving this issue requires several changes. The FFI for DB_ENV->failchk, set_thread_id, set_isalive, and set_thread_count must be implemented and set up to correctly deal with Lisp threads. This seems somewhat hairy to get working on all implementations and OSes. Unfortunately this issue crops up fairly frequently for me. Has anyone else run into it? -Red
_______________________________________________ elephant-devel site list elephant-devel@common-lisp.net http://common-lisp.net/mailman/listinfo/elephant-devel