Randy Letness wrote:
Mike Matrigali wrote:
Posting complete information from derby.log may help. Just post all
of it from the most recent boot message to the end of the error.
I could but I didn't want to spam the list with a 100K of text.
That's fine. As was suggested by someone else - go ahead and file a
JIRA. That gives a great place to upload a 100k file so someone else
can look at it.
My best guess with the limited information so far is that you are
encountering an index split deadlock. So in the case below transaction
6154 is the writer, 6159 is the reader, and 6160 is the internal
transaction trying to do an index split on behalf of 6154. (10, 1)
row lock is a special internal lock for derby, which allows better
performance for scanning, but reduced concurrency for splitting. It
basically protects the scan "position" on a page while the scan does
not have a latch on the page. Not sure exactly but from your program
description 6154 has row locks on every row in the table and is likely
in the insert phase and needs to split the btree index, forks a child
transaction 6160 to do the split which is waiting on 6159 the reader.
The reader is blocked on row (2,110) which is one of the many row
locks held by the writer.
Ah this makes sense. I didn't take into account internal transactions.
Index split deadlock huh? I would have never thought of that. Does
this happen often?
I almost never see it, but of course every app is different. The
problem is that your test case, I think
guarantees it happens since you lock every row in the delete and the
scanner is stuck on waiting on one of the rows.
What it takes is the writer adding rows to the same leaf index page as
the scanner is blocked on, and needing to split the page that the
scanner is blocked on.
In the 3 million row app is it likely the scanner looks at same rows
as writer?
Altering the page size of the index may give you a partial workaround.
Bigger index pages would mean less total splits, thus less chance for
this deadlock case. But that is not guaranteed.
As a workaround, since your application knows it is going to delete
all the rows in the table it may work to just set table level locking
for the update transaction statement doing the delete of all rows.
The problem is that in my test program it turns out to be all the rows
in the table, but in a the environment the problem was discovered it
will only be 300 of 3 million rows, so locking the table will reduce
concurrency.
ok, workaround makes no sense here. In the real environment is the
scanner likely to be looking at the same rows that you are deleting?
The best way to track down what is going on with the thread that you
don't think should be active is to reproduce the problem with some
extra information being logged. This way it should be obvious if
the extra transaction has committed or not, and if the extra transaction
is an internal transaction. The derby.log may also have stack traces
which would give me a better idea what is going on.:
set derby.locks.monitor=true and derby.locks.deadlockTrace=true in
derby.properties. This will print the lock table information to the
derby.log if timeout happens. If you also set
derby.language.logStatementText=true, then all the statements executed
on the system will be written out to the derby.log. Statement
information also includes the transaction id on which the statement is
being executed. You can use the transaction id holding the lock in the
lock table dump to do a reverse search for the transaction id in
derby.log, making it possible to narrow down which statements may be
holding the locks.
I have derby.locks.monitor=true and derby.locks.deadlockTrace=true and
get the lock table dump. I'll try setting
derby.language.logStatementText=true and see what extra info I can get.
Thanks for all the help.
-Randy
Randy Letness wrote:
Mike Matrigali wrote:
also just posting the deadlock message with the lock information
about the locks in the deadlock cycle will answer a lot of questions.
It may depend if there are indexes, it may depend on how many rows
are in the table and if escalation happened. Also make sure to
include what version of the software you are running.
I'm running Derby 10.2.1.6.
The message is:
2007-01-31 16:57:53.828 GMT Thread[reader0,5,main] (XID = 6159),
(SESSIONID = 231), (DATABASE = target/test-db), (DRDAID = null),
Failed Statement is: select distinct i.* from item i,
cal_timerange_index ctri1 where i.parentid=? and i.isactive=1 and
ctri1.itemid=i.id and (ctri1.componenttype = ? and ((ctri1.startdate
< case when ctri1.isfloating=1 then '20080218T115000' else ? end and
ctri1.enddate > case when ctri1.isfloating=1 then '20080117T115000'
else ? end) or (ctri1.startdate >= case when ctri1.isfloating=1 then
'20080117T115000' else ? end and ctri1.startdate < case when
ctri1.isfloating=1 then '20080218T115000' else ? end)))
So the select statement is a join of two tables. The statement
returns about 30 rows. There are roughly 300 rows in
cal_timerange_index. The update transaction removes all these rows
and adds the same amount, while the read transaction selects 30 of
the rows that are being deleted by the update transaction.
The lock dump shows:
*** The following row is the victim ***
6159 |ROW |S |0 |(2,110) |WAIT |T |NULL
|CAL_TIMERANGE_INDEX |
Here are the locks involved:
6154 |ROW |X |1 |(2,110)
|GRANT|T|NULL|CAL_TIMERANGE_INDEX |
6159 |ROW |S |1
|(10,1)|GRANT|T|SQL070131105108180|CAL_TIMERANGE_INDEX |
6160 |ROW |X |0 |(10,1)|WAIT
|T|SQL070131105108180 |CAL_TIMERANGE_INDEX |
There are also these table locks:
6154 |TABLE |IX |315 |Tablelock |GRANT|T|NULL
|CAL_TIMERANGE_INDEX |
6159 |TABLE |IS |2 |Tablelock |GRANT|T |NULL
|CAL_TIMERANGE_INDEX |
6160 |TABLE |IX |1 |Tablelock |GRANT|T |NULL
|CAL_TIMERANGE_INDEX
A couple things to note. There are 3 transactions, yet only 2 should
be active at a time (1 writer thread, 1 reader thread). Also none of
these transactions is in a traditional deadlock (where there are
waiting on each other). Instead, the deadlock occurs because tx6154
doesn't release its locks. 6154 isn't waiting on anything. Its
almost as if the locks aren't getting released. I think i'm going to
try to work on a simpler example to reproduce this.
-Randy