Thanks for taking the time to take a look Drye. I agree that this is not a deadlock as one of the requests earlier in the day did complete, it just took under 3 hours for it to do so ;) At the same time during all of this going on, other requests were being processed completely in a matter of seconds. Even while these 5 threads were in this state, other (same) requests using other connections, completed in a matter of seconds. From this, I an 99.999% sure it is not a general problem with preparing the query.
I do believe that "something" has triggered something like what you suggest in #2. I can build and debug Derby and will start looking through the code but am hoping for initially even to be pointed into the correct area of the code where this creating of the optimized plan is located to point me in the correct are to examine. Another question, but I believe I know the answer. Since I have "derby.jdbc.xaTransactionTimeout=1800" I thought this might kick in and cancel this request, but in thinking about this, it probably only kicks in "between" XA transaction operations to kill a long running and maybe forgotten XA transaction. There was an issue in Derby that I debugged and provided patches for and are now part of the release where XA transactions could be left open forever, causing the derby transaction log to grow forever (tens of thousands of log files were seen in production). The timeout was enabled just as a safeguard to protect against a problem. Again, thanks for taking a look and confirming what I think is happening. Brett -----Original Message----- From: Dyre Tjeldvoll [mailto:[email protected]] Sent: Friday, March 14, 2014 8:45 AM To: [email protected] Subject: Re: I need help diagnosing a Derby engine problem where threads are not progressing Hi Brett, I did look briefly at your stacktraces, but I was hoping someone with more Derby "cred" would pick up the issue. Based on what you have said and the stack traces it looks to me like you don't have a traditional deadlock. It looked like all the threads you mentioned were competing for the monitor lock on the same PreparedStatement (which they get from the statement cache). One thread has the lock and it appears to be spending quite a bit of time optimizing, as you say. But in later stack traces, the thread that used to hold the lock is now trying to acquire it again - presumably because it has finished the previous prepare. So to me it looks like there are at least two issues here, 1. That the prepare takes so long 2. Why all the threads seem to think that they need to reprepare the query. I'm thinking that either there is "something" that keeps invalidating the query. If that "something" occurs frequently enough that the query is already invalid by the time the long-lasting prepare finishes, that could make the system appear to hang. The fact that the thread that was originally preparing the query, apparently tries to acquire the lock again almost immediately after finishing the prepare point in this direction. Alternatively, there could be some kind of poor synchronization/race condition causing all the threads to start repreparing the query simultaneously (for the same reason). So when the thread tries to acquire the lock, it is blocked until the first thread has finished the long-lasting prepare, but when it obtains the lock it does not detect that the reprepare is no longer needed and proceeds with another (unecessary) long-lasting prepare. And so on... But this is just speculation, I have not had a chance to look at the code to verify either of these theories. Hopefully, others can comment also. On 03/14/2014 12:33 PM, Bergquist, Brett wrote: > Sorry to be a persistent but the this is a very serious issue that I > am seeing in production and would just like some guidance on where to > look in the source for a possible issue. > > My assumption is that there is a possibility that the optimization may > get into a state where it may take a very long time to complete or may > even be looping. I say this because of the stack traces that I have, > the fact that at least one time a long executing instance did > complete, and the fact that this same query is executed thousand of > times a day without issue. > > The query does have multiple tables to join (about 8 if I remember > correctly but can get the details), so there could be many query > plans, but there are proper indexes on each table. Also the > statistics are up to date and the same query was being performed while > 5 instances were stuck were being performed over and over during the > same period. The data being queried upon is not changing that > rapidly, mostly increasing in row count, but at a rate of hundreds of > rows per day with very few rows being removed. > > So what I would like to explore is under what conditions could the > optimizer get into a state where it might loop or make no progress for > a very long period of time. > > Thanks for any insight that you might provide. > > On Mar 13, 2014, at 3:32 PM, Bergquist, Brett <[email protected] > <mailto:[email protected]>> wrote: > >> Seehttps://issues.apache.org/jira/browse/DERBY-6510 >> Basically it looks like for some reason the system gets stuck in >> trying to create the query plan. Copied from the JIRA: >> We had an issue today in a production environment at a large customer >> site. Basically 5 database interactions became stuck and are not >> progressing. Part of the system dump performs a stack trace every few >> seconds for a period of a minute on the Glassfish application server >> and the Derby database engine (running in network server mode). Also, >> the dump captures the current transactions and the current lock table >> (ie. syscs_diag.transactions and syscs_diag.lock_table). We had to >> restart the system and in doing so, the Derby database engine would >> not shutdown and had to be killed. >> >> The stack traces of the Derby engine show 5 threads that are >> basically making no progress in that at each sample, they are at the >> same point, waiting. >> >> I will attach the stack traces as well as the state of the >> transactions and locks. >> >> Interesting is that the "derby.jdbc.xaTransactionTimeout =1800" is >> set, yet the transactions did not timeout. The timeout is for 30 >> minutes but the transactions were in process for hours. >> t looks like there is an issue where it the optimizer is getting >> stuck. If you look at DRDAConnThread_42, it looks like it is trying >> to create an plan whree it is calling getNextDecoratedPermutation. >> Note that all 5 of the transactions were doing the exact same query >> to the database. They were started at different times however, many >> minutes, if not hours apart (ie. 7:59:25, 8.14:45, 8:40:00, 9:09:11, >> 9:18:20) all today. >> Note that many hundreds of times during this period, the same exact >> queries were done successfully with no issue. This same query is >> processed many thousands of times per day. >> Another point which cannot be seen by the attached files is that >> there was another instance of this but it did complete but it took 2 hours >> and 46 seconds. So it is possible that these might have completed >> given enough time. We may have seen this before where the customer >> has complained of processing that has taken hours but until today, I >> never had a good capture of the state. >> I am asking for help on where to look in the source code so that I >> might be able to determine a possible fix and any help or ideas will >> be greatly appreciated. > -- Regards, Dyre
