Re: I need help diagnosing a Derby engine problem where threads are not progressing

Rick Hillegas Fri, 14 Mar 2014 06:32:56 -0700

No answers, just some more questions...

On 3/14/14 5:44 AM, Dyre Tjeldvoll wrote:

Hi Brett,
I did look briefly at your stacktraces, but I was hoping someone withmore Derby "cred" would pick up the issue. Based on what you have saidand the stack traces it looks to me like you don't have a traditionaldeadlock. It looked like all the threads you mentioned were competingfor the monitor lock on the same PreparedStatement (which they getfrom the statement cache). One thread has the lock and it appears tobe spending quite a bit of time optimizing, as you say. But in laterstack traces, the thread that used to hold the lock is now trying toacquire it again - presumably because it has finished the previousprepare. So to me it looks like there are at least two issues here,
1. That the prepare takes so long

An 8 table join with a lot of indexes could give rise to a lot ofcombinations.

2. Why all the threads seem to think that they need to reprepare thequery. I'm thinking that either there is "something" that keepsinvalidating the query. If that "something" occurs frequently enoughthat the query is already invalid by the time the long-lasting preparefinishes, that could make the system appear to hang. The fact that thethread that was originally preparing the query, apparently tries toacquire the lock again almost immediately after finishing the preparepoint in this direction.

Right. Why is the query being re-prepared in the first place? Is thereany DDL being performed on the table? Table compression? The tabledoesn't seem to be changing fast enough that statistics would beautomatically recalculated. But maybe there's a bug in theauto-stats-calculator. Might be worth turning auto-stats-calculation offand see if the problem goes away.

Is the application prone to interrupts? Maybe there are some more looseends in interrupt handling.

Alternatively, there could be some kind of poor synchronization/racecondition causing all the threads to start repreparing the querysimultaneously (for the same reason). So when the thread tries toacquire the lock, it is blocked until the first thread has finishedthe long-lasting prepare, but when it obtains the lock it does notdetect that the reprepare is no longer needed and proceeds withanother (unecessary) long-lasting prepare. And so on...

But this is just speculation, I have not had a chance to look at thecode to verify either of these theories.


Hopefully, others can comment also.


On 03/14/2014 12:33 PM, Bergquist, Brett wrote:

Sorry to be a persistent but the this is a very serious issue that I am
seeing in production and would just like some guidance on where to look
in the source for a possible issue.

My assumption is that there is a possibility that the optimization may
get into a state where it may take a very long time to complete or may
even be looping.   I say this because of the stack traces that I have,
the fact that at least one time a long executing instance did complete,
and the fact that this same query is executed thousand of times a day
without issue.

The query does have multiple tables to join (about 8 if I remember
correctly but can get the details), so there could be many query plans,
but there are proper indexes on each table.  Also the statistics are up
to date and the same query was being performed while 5 instances were
stuck were being performed over and over during the same period.  The
data being queried upon is not changing that rapidly, mostly increasing
in row count, but at a rate of hundreds of rows per day with very few
rows being removed.

So what I would like to explore is under what conditions could the
optimizer get into a state where it might loop or make no progress for a
very long period of time.

Thanks for any insight that you might provide.

On Mar 13, 2014, at 3:32 PM, Bergquist, Brett <[email protected]
<mailto:[email protected]>> wrote:

Seehttps://issues.apache.org/jira/browse/DERBY-6510
Basically it looks like for some reason the system gets stuck in
trying to create the query plan.   Copied from the JIRA:
We had an issue today in a production environment at a large customer
site. Basically 5 database interactions became stuck and are not
progressing. Part of the system dump performs a stack trace every few
seconds for a period of a minute on the Glassfish application server
and the Derby database engine (running in network server mode). Also,
the dump captures the current transactions and the current lock table
(ie. syscs_diag.transactions and syscs_diag.lock_table). We had to
restart the system and in doing so, the Derby database engine would
not shutdown and had to be killed.

The stack traces of the Derby engine show 5 threads that are basically
making no progress in that at each sample, they are at the same point,
waiting.

I will attach the stack traces as well as the state of the
transactions and locks.

Interesting is that the "derby.jdbc.xaTransactionTimeout =1800" is
set, yet the transactions did not timeout. The timeout is for 30
minutes but the transactions were in process for hours.
t looks like there is an issue where it the optimizer is getting
stuck. If you look at DRDAConnThread_42, it looks like it is trying to
create an plan whree it is calling getNextDecoratedPermutation.
Note that all 5 of the transactions were doing the exact same query to
the database. They were started at different times however, many
minutes, if not hours apart (ie. 7:59:25, 8.14:45, 8:40:00, 9:09:11,
9:18:20) all today.
Note that many hundreds of times during this period, the same exact
queries were done successfully with no issue.   This same query is
processed many thousands of times per day.
Another point which cannot be seen by the attached files is that there
was another instance of this but it did complete but it took 2 hours
and 46 seconds.   So it is possible that these might have completed
given enough time.   We may have seen this before where the customer
has complained of processing that has taken hours but until today, I
never had a good capture of the state.
I am asking for help on where to look in the source code so that I
might be able to determine a possible fix and any help or ideas will
be greatly appreciated.

Re: I need help diagnosing a Derby engine problem where threads are not progressing

Reply via email to