I would like to see some solution to DERBY-700 go into the 10.3 release,
as it is too easy currently for users to access their same database from
2 different classloaders in the same JVM and corrupt their database.
Let me try to explain the problem again. Derby protects multiple thread
access to databases using in memory locking. Any situation where 2
instances of Derby can access the same on disk database but not
coordinate through the same in-memory locking scheme creates a situation
that can lead to log/data corruption.
Derby coordinates access to an ondisk database using a 2 level per db
locking scheme. The following is high level and probably misses some
detail.
1) first it obtains what I refer to as an OS file lock, this is not a
java defined mechanism. It depends on an OS specific behavior that it
is not possible to delete a file on some OS's if the file is still open
(the most common OS where this is true are I believe all window's
OS's 98, NT, 2000, xp, ...).
A) If lock file does not exist we open it and leave it open -->
lock granted
B) If lock file does exist we try to delete it, if we can then
we go back to A. If we can't we assume file is locked and
give up.
2) The second level is that we use java base file locking. This only
became available in 1.4.2 and later jvm's. This does standard
locking and automatically takes care of releasing lock if JVM goes
away.
We kept both steps, even though it seems that step 2 is sufficient to
provide backward compatible protection. So anyone running on a JVM
prior to 1.4.2 on a windows platform would still be protected. The
pre-derby code base also had to worry about protecting access against
versions of the code that did not have step 2 implemented.
The problem is that in cases where derby is run in 2 class loaders in
the same jvm the step 2 file locking does not work. It is meant only
to protect threads in different JVMS, so it provides no help in
preventing access from 2 different class loaders in the same JVM. It
turns out that step 1 on windows system solves the locking problem
on windows systems for multiple class loaders also.
So a solution should provide 2 things:
1) prevent access from another classloader in the same JVM
2) not allow false positives. So for instance a standard "lock file"
could be used on unix systems, creating it and when one boots check
for existence of the lock file and give up if it exists. The problem
is that it is very easy to cause a JVM to exit without properly cleaning
the lock file and thus one would get into situation where user may have
to clean lock files by hand.
Here is my current proposal, but I really don't like it -- I am hoping
someone out there can come up with something better.
o keep the step 1 and step 2 locking as described above. It solves
cross JVM locking completely.
o create a step 3 locking step, the only purpose is to recognize
situations step 1 and 2 can't.
o use simple file system lock file to implement step 3 locking:
A) on db boot if no lock file exists, create it and put a timestamp
in the file. This db boot is responsible for updating that timestamp
every N seconds, for this we need to come up with a guaranteed executing
background thread - there are known problems with current background
thread, and long checkpoints for instance. On shutdown of db we delete
lock file.
B) on db boot if lock file exists we open it, and get the timestamp
and compare it with the current timestamp. If the difference is greater
than N we assume the lock file has been left around incorrectly and we
delete the file and go to A (or just open it and update it with our
current timestamp). It probably is worth logging this event in derby.log.
One nice thing about the above solution is that I think it also solves
our problem with muliple machines accessing the same disk (as long as
their timestamps are the same or close). I think we can pick a large N
as this should be an error case (ie. the purpose of N is to catch the
case where a classloader went away without allowing us to clean up - I
don't know classloader stuff to know how likely this is), but it is
probably worth making it
configurable so we could adjust in the field if necessary.
I really don't like forcing Derby to run a job every N seconds. It
could be hard to explain to users why derby is doing work every N
seconds even when nothing else is happening. I worked on a different
product a long time ago that required us to maintain our own timer for
use in scheduling waits and users noticed and complained such that we
had to add a backoff mechanism based on the amount of work being done
just so the process would not show up at a steady 1% (or whatever) on
there process status monitors. Now that timer was a lot shorter than
seconds so it may not be an issue.
Some extensions I considered:
1) come up with a unique ID that is specific to a JVM and can be queried
by any thread in any classloader in the JVM. If we had that then we
could write that value into the step 3 lock file and we know that if we
opened the file and saw a different ID, the lock file was invalid. If
we did this then we narrow the chance of a false positive but we
eliminate the chance to catch cross machine access.
2) Have the opener db log some unique id specific to it in addition to
the timestamp, maybe this is just the first timestamp of the open. Then
it could at least tell the next time if another
class loader had incorrectly started, and throw/log some error so we
could catch this problem.
3) maybe the value of N should be logged in the file, I think it has to
be if we allow it to be configurable.