Bugs item #2801629, was opened at 2009-06-05 11:50
Message generated for change (Comment added) made by johnvanschie
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2801629&group_id=56967
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John van Schie (johnvanschie)
Assigned to: Peter Boncz (boncz)
Summary: Mserver sends no response, client hangs.
Initial Comment:
We are using MonetDB4, Feb2009-SP2, build from source (build log attached) with
64bit OIDS.
The platform is Fedora core 10, with 64 GB RAM (55GB cached) and 99GB free disk
space for the dbfarm.
Our application uses mclient to manipulate XML in the database. This morning, I
saw a mclient process that did not terminate and is still running after approx.
12 hours. Using strace on the mclient process shows that the process is waiting
for blocking I/O. Executing a fresh mclient with the query "1+1" results in a
stalled application, that also waits for blocking I/O. (see attached strace).
It seems that Mserver stopped sending data to the clients.
To debug this problem, I've generated stack traces of all threads for the
Mserver and a list of open files (lsof). The server is still running, and I
plan to keep it running unless no more information is needed.
Unfortunately, the server is build with optimisation enabled and cannot share
the data, as it is confidential.
----------------------------------------------------------------------
>Comment By: John van Schie (johnvanschie)
Date: 2009-08-13 10:44
Message:
In the attached query log, you can see that query 2936, 2937, and 2938
never returned with an answer. In my process list, I can see a mclient
process for query 2936, while query 2937 and 2938 are performed via JDBC.
BTW, the XML files contain no confidential data, so could be send to the
CWI if needed.
----------------------------------------------------------------------
Comment By: John van Schie (johnvanschie)
Date: 2009-08-13 10:09
Message:
Peter,
It seems that I have another case of the same problem. This time it is on
a Ubuntu 8.04 LTS machine, x86_64, 64 bits Monet Feb2009-SP2 (incl patch
for fast inserting, see a thread in your mail box with the subject 'Slow
import of large XML documents'). The machine has 32G RAM and 1.9T free for
the DB farm.
I suspect that this incident has the same cause as the incident for which
I first reported this bug. The symptoms are equal and the gdb backtraces
look very similar. I will attach the gdb traces of this new occurrence as
gdb.2.txt. I will also add an query log for you.
Hope this helps in finding the cause.
Cheers,
John
----------------------------------------------------------------------
Comment By: Peter Boncz (boncz)
Date: 2009-07-09 20:32
Message:
John,
I am very sorry I missed your query log attach on June 12. Either the
sourceforge notification mail (as it is assigned to me= got lost or I mssed
it. SF is a bit crappy, so feel free to write me a ping message in the
future when you respond, then you are sure I see it quickly.
The querylog is very useful and shows that one complicated read query was
busy when to concurrent add-docs eneterd the system. This apparently caused
a deadlock. I will now start thinking and looking at the code, where this
could be.
thanks
Peter
1144: Thu Jun 04 20:39:08 CEST 2009 storeLargeXQueryResultToFile (file:
/tmp/jobGen19669.xml):
element job {attribute tool {"general-purpose/mime-tool"},attribute
project { "prj_xx"}, for $xirafNode in subsequence( for $xirafIter in (
doc("prj_xx.xml")//file[properties/stream/@xstart and
properties/stream/@xend]
)
where $xirafIter/@xid
and not($xirafIter/container/@tool[.="general-purpose/mime-tool"])and
not(some $x in (for $taskIter in
doc("50E5E1E3-F45A-ED1F-8122-963D5A04DDE8.xml")//r...@tool="general-purpose/mime-tool"]//task/@xid
return $taskIter cast as xs:integer) satisfies $x = $xirafIter/@xid)
and $xirafIter/properties/stream/@xstart
return $xirafIter, 327681, 16384)
return element { "item" } {$xirafNode/@xid,if
(exists($xirafNode/properties/point/@xpoint)) then
($xirafNode/properties/point/@xpoint,
$xirafNode/../../properties/stream/@xstart,
$xirafNode/../../properties/stream/@xend)
else
($xirafNode/properties/stream/@xstart,
$xirafNode/properties/stream/@xend)
,element properties { element property {attribute key
{"error-trace-info-1"},attribute value
{zero-or-one($xirafNode/properties/path)}}, ()}
}}
1145: Thu Jun 04 20:39:09 CEST 2009 insertXMLDoc:
pf:add-doc("/var/dbtransfer/xiraf19672.xml","-1690507529760011806-62-1244140748843.xml","prj_xx_log.xml")
1146: Thu Jun 04 20:39:09 CEST 2009 insertXMLDoc:
pf:add-doc("/var/dbtransfer/xiraf19673.xml","-9054443763826720862-61-1244140748843.xml","prj_xx_log.xml")
1147: Thu Jun 04 20:49:08 CEST 2009 executeQuery:
1+1
----------------------------------------------------------------------
Comment By: John van Schie (johnvanschie)
Date: 2009-07-09 13:32
Message:
Peter,
Is there anything else I could do to provide more information for you? For
what it's worth, we haven't encountered this problem any more, so it seems
a really exceptional case.
Cheers,
John
----------------------------------------------------------------------
Comment By: John van Schie (johnvanschie)
Date: 2009-06-12 11:44
Message:
Peter,
I've attached a query log (querylog.log) that contains the requested
information. In fact, it is the tail of the full query log. Each query in
the log is assigned a number and when the query returns, the duration of
the query is printed. So we know that query 1144 returned a non-zero exit
code and query 1145, 1146 and 1147 never returned.
Although I'm not able to supply you the exact documents that are
referenced in the query log, I could explain the structure of the documents
if required.
Hope this helps in finding the cause.
Cheers,
John
----------------------------------------------------------------------
Comment By: Peter Boncz (boncz)
Date: 2009-06-09 17:08
Message:
Hi John,
Thanks for the bug report. It is hard to say what has happenend, it could
be that the so-called short lock, which is apparently taken but not freed
by an interpreter thread is not given back. This in the end blocks all
incoming queries.
The most useful info you attached is the gdb trace. However, it would
really help if you could send me the last (~12) queries that went into the
server. It would already be good to know whether these are read-only,
document managment (add_doc/del_doc) or update queries. There appears to be
at least one update query there.
Another possible cause of deadlocks is sometimes bad error handling.
Therefore, if there have been anby error messages coming out of that
MonetDB instance, that would also be great to know.
I will try to keep thinking, but if there is any additional information
that you can share, it would greatly help the chances of finiding a
solution/fix.
thanks,
Peter
----------------------------------------------------------------------
Comment By: Sjoerd Mullender (sjoerd)
Date: 2009-06-08 18:04
Message:
This looks like a classic deadlock situation:
thread 23 is waiting for a lock in pflock_trycommit,
thread 21 is waiting for a lock in pflock_end,
threads 11, 9, 8, 6, 5, 4, 3, 2 are waiting for a lock in pflock_begin,
threads 19, 7 are waiting for a lock in set_lock.
What might be the case is that all but one of the threads waiting in
pflock_begin and the two threads waiting in pflock_trycommit and pflock_end
are all waiting for the same lock (PF_META_LOCK) which might be held by
that one pflock_begin thread, which itself could be waiting for another
lock (PF_SHORT_LOCK). Perhaps one of the other two threads waiting in
set_lock has PF_SHORT_LOCK and is waiting for yet another lock.
In any case, this seems an area where Peter has the expertise.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=482468&aid=2801629&group_id=56967
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
trial. Simplify your report design, integration and deployment - and focus on
what you do best, core application coding. Discover what's new with
Crystal Reports now. http://p.sf.net/sfu/bobj-july
_______________________________________________
Monetdb-bugs mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/monetdb-bugs