Bugs item #2801629, was opened at 2009-06-05 11:50 Message generated for change (Comment added) made by johnvanschie You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2801629&group_id=56967
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: John van Schie (johnvanschie) Assigned to: Peter Boncz (boncz) Summary: Mserver sends no response, client hangs. Initial Comment: We are using MonetDB4, Feb2009-SP2, build from source (build log attached) with 64bit OIDS. The platform is Fedora core 10, with 64 GB RAM (55GB cached) and 99GB free disk space for the dbfarm. Our application uses mclient to manipulate XML in the database. This morning, I saw a mclient process that did not terminate and is still running after approx. 12 hours. Using strace on the mclient process shows that the process is waiting for blocking I/O. Executing a fresh mclient with the query "1+1" results in a stalled application, that also waits for blocking I/O. (see attached strace). It seems that Mserver stopped sending data to the clients. To debug this problem, I've generated stack traces of all threads for the Mserver and a list of open files (lsof). The server is still running, and I plan to keep it running unless no more information is needed. Unfortunately, the server is build with optimisation enabled and cannot share the data, as it is confidential. ---------------------------------------------------------------------- >Comment By: John van Schie (johnvanschie) Date: 2009-07-09 13:32 Message: Peter, Is there anything else I could do to provide more information for you? For what it's worth, we haven't encountered this problem any more, so it seems a really exceptional case. Cheers, John ---------------------------------------------------------------------- Comment By: John van Schie (johnvanschie) Date: 2009-06-12 11:44 Message: Peter, I've attached a query log (querylog.log) that contains the requested information. In fact, it is the tail of the full query log. Each query in the log is assigned a number and when the query returns, the duration of the query is printed. So we know that query 1144 returned a non-zero exit code and query 1145, 1146 and 1147 never returned. Although I'm not able to supply you the exact documents that are referenced in the query log, I could explain the structure of the documents if required. Hope this helps in finding the cause. Cheers, John ---------------------------------------------------------------------- Comment By: Peter Boncz (boncz) Date: 2009-06-09 17:08 Message: Hi John, Thanks for the bug report. It is hard to say what has happenend, it could be that the so-called short lock, which is apparently taken but not freed by an interpreter thread is not given back. This in the end blocks all incoming queries. The most useful info you attached is the gdb trace. However, it would really help if you could send me the last (~12) queries that went into the server. It would already be good to know whether these are read-only, document managment (add_doc/del_doc) or update queries. There appears to be at least one update query there. Another possible cause of deadlocks is sometimes bad error handling. Therefore, if there have been anby error messages coming out of that MonetDB instance, that would also be great to know. I will try to keep thinking, but if there is any additional information that you can share, it would greatly help the chances of finiding a solution/fix. thanks, Peter ---------------------------------------------------------------------- Comment By: Sjoerd Mullender (sjoerd) Date: 2009-06-08 18:04 Message: This looks like a classic deadlock situation: thread 23 is waiting for a lock in pflock_trycommit, thread 21 is waiting for a lock in pflock_end, threads 11, 9, 8, 6, 5, 4, 3, 2 are waiting for a lock in pflock_begin, threads 19, 7 are waiting for a lock in set_lock. What might be the case is that all but one of the threads waiting in pflock_begin and the two threads waiting in pflock_trycommit and pflock_end are all waiting for the same lock (PF_META_LOCK) which might be held by that one pflock_begin thread, which itself could be waiting for another lock (PF_SHORT_LOCK). Perhaps one of the other two threads waiting in set_lock has PF_SHORT_LOCK and is waiting for yet another lock. In any case, this seems an area where Peter has the expertise. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2801629&group_id=56967 ------------------------------------------------------------------------------ Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge _______________________________________________ Monetdb-bugs mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/monetdb-bugs
