On 11/10/08 01:27, Daniel Noll wrote:
Kristian Waagan wrote:
I feel we have too little information to create a fix - we don't even
know what the real problem is.
The locator values are drawn from a counter, and there is a counter
for each (root) connection. I'm having trouble understanding how we
could get concurrency issues in this case.
Also, I think the error you are seeing suggests an invalid locator
value, not a duplicate value.
Anything special about your network server setup? (time-slicing,
statement caching, connection pooling)
My suggestion is to wait for a while and see if it happens again, or
see if anyone else has suggestions.
It has happened again. This time it took 12 hours for it to happen,
which is information I didn't previously have. If I'm lucky this will
help reproducing it here. Maybe it's something that takes a long time
until it occurs. Or maybe it's something where the probability is just
really low so it takes an enormous number of attempts before it happens.
As far as the network server setup itself, it's straight-forward. We're
not using connection pooling due to bugs preventing that from working
properly, and everything else is normal as well.
Are the problems you are having with connection pooling logged in Jira?
I guess I can run a test overnight to see if something similar happens,
with tracing turned on. It's going to generate a lot of output though
so I somewhat fear for my disk space. :-)
You can also run the test without logging to see if it can be reproduced
by a 12 hour run. If so, I think we have two initial options;
a) Synchronize the access to the counter properly
b) Add custom logging to the code that fails, to see which value
causes the failure. If it is one of the invalid locator values, it's a
strong indication that the problem is indeed the counter.
The bug I'm thinking of on one with a low probability, so if it happens
constantly after ~12 hours it sounds more like an overflow problem of
some kind.
If you can give me some more details about the data and the load, I
might be able to kick of some test runs of my own;
- Blob size
- number of rows in the table
- number of clients accessing the table concurrently
- isolation level
- page cache size
- any other information you think might be relevant
--
Kristian
Daniel