Bugs item #2801629, was opened at 2009-06-05 11:50
Message generated for change (Comment added) made by johnvanschie
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2801629&group_id=56967

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John van Schie (johnvanschie)
Assigned to: Peter Boncz (boncz)
Summary: Mserver sends no response, client hangs.

Initial Comment:
We are using MonetDB4, Feb2009-SP2, build from source (build log attached) with 
64bit OIDS.
The platform is Fedora core 10, with 64 GB RAM (55GB cached) and 99GB free disk 
space for the dbfarm.

Our application uses mclient to manipulate XML in the database. This morning, I 
saw a mclient process that did not terminate and is still running after approx. 
12 hours. Using strace on the mclient process shows that the process is waiting 
for blocking I/O. Executing a fresh mclient with the query "1+1" results in a 
stalled application, that also waits for blocking I/O. (see attached strace). 
It seems that Mserver stopped sending data to the clients.

To debug this problem, I've generated stack traces of all threads for the 
Mserver and a list of open files (lsof). The server is still running, and I 
plan to keep it running unless no more information is needed.
Unfortunately, the server is build with optimisation enabled and cannot share 
the data, as it is confidential.

----------------------------------------------------------------------

>Comment By: John van Schie (johnvanschie)
Date: 2009-07-09 13:32

Message:
Peter,

Is there anything else I could do to provide more information for you? For
what it's worth, we haven't encountered this problem any more, so it seems
a really exceptional case.

Cheers,

John

----------------------------------------------------------------------

Comment By: John van Schie (johnvanschie)
Date: 2009-06-12 11:44

Message:
Peter,

I've attached a query log (querylog.log) that contains the requested
information. In fact, it is the tail of the full query log. Each query in
the log is assigned a number and when the query returns, the duration of
the query is printed. So we know that query 1144 returned a non-zero exit
code and query 1145, 1146 and 1147 never returned.

Although I'm not able to supply you the exact documents that are
referenced in the query log, I could explain the structure of the documents
if required.

Hope this helps in finding the cause.

Cheers,

John

----------------------------------------------------------------------

Comment By: Peter Boncz (boncz)
Date: 2009-06-09 17:08

Message:
Hi John,

Thanks for the bug report. It is hard to say what has happenend, it could
be that the so-called short lock, which is apparently taken but not freed
by an interpreter thread is not given back. This in the end blocks all
incoming queries.

The most useful info you attached is the gdb trace. However, it would
really help if you could send me the last (~12) queries that went into the
server. It would already be good to know whether these are read-only,
document managment (add_doc/del_doc) or update queries. There appears to be
at least one update query there.

Another possible cause of deadlocks is sometimes bad error handling.
Therefore, if there have been anby error messages coming out of that
MonetDB instance, that would also be great to know.

I will try to keep thinking, but if there is any additional information
that you can share, it would greatly help the chances of finiding a
solution/fix.

thanks,

Peter


----------------------------------------------------------------------

Comment By: Sjoerd Mullender (sjoerd)
Date: 2009-06-08 18:04

Message:
This looks like a classic deadlock situation:
thread 23 is waiting for a lock in pflock_trycommit,
thread 21 is waiting for a lock in pflock_end,
threads 11, 9, 8, 6, 5, 4, 3, 2 are waiting for a lock in pflock_begin,
threads 19, 7 are waiting for a lock in set_lock.

What might be the case is that all but one of the threads waiting in
pflock_begin and the two threads waiting in pflock_trycommit and pflock_end
are all waiting for the same lock (PF_META_LOCK) which might be held by
that one pflock_begin thread, which itself could be waiting for another
lock (PF_SHORT_LOCK).  Perhaps one of the other two threads waiting in
set_lock has PF_SHORT_LOCK and is waiting for yet another lock.

In any case, this seems an area where Peter has the expertise.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2801629&group_id=56967

------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
_______________________________________________
Monetdb-bugs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/monetdb-bugs

Reply via email to