On 11/10/08 01:27, Daniel Noll wrote:
Kristian Waagan wrote:
I feel we have too little information to create a fix - we don't even know what the real problem is. The locator values are drawn from a counter, and there is a counter for each (root) connection. I'm having trouble understanding how we could get concurrency issues in this case. Also, I think the error you are seeing suggests an invalid locator value, not a duplicate value.

Anything special about your network server setup? (time-slicing, statement caching, connection pooling)

My suggestion is to wait for a while and see if it happens again, or see if anyone else has suggestions.

It has happened again. This time it took 12 hours for it to happen, which is information I didn't previously have. If I'm lucky this will help reproducing it here. Maybe it's something that takes a long time until it occurs. Or maybe it's something where the probability is just really low so it takes an enormous number of attempts before it happens.

As far as the network server setup itself, it's straight-forward. We're not using connection pooling due to bugs preventing that from working properly, and everything else is normal as well.

Are the problems you are having with connection pooling logged in Jira?


I guess I can run a test overnight to see if something similar happens, with tracing turned on. It's going to generate a lot of output though so I somewhat fear for my disk space. :-)

You can also run the test without logging to see if it can be reproduced by a 12 hour run. If so, I think we have two initial options;
 a) Synchronize the access to the counter properly
b) Add custom logging to the code that fails, to see which value causes the failure. If it is one of the invalid locator values, it's a strong indication that the problem is indeed the counter.

The bug I'm thinking of on one with a low probability, so if it happens constantly after ~12 hours it sounds more like an overflow problem of some kind.


If you can give me some more details about the data and the load, I might be able to kick of some test runs of my own;
 - Blob size
 - number of rows in the table
 - number of clients accessing the table concurrently
 - isolation level
 - page cache size
 - any other information you think might be relevant


--
Kristian


Daniel



Reply via email to