My best advice for now has been to explicitly synchronize on the
repository instance whenever you are doing versioning operations. Note
that you can still do normal read and write operations concurrently
with versioning, so this isn't as bad as it could be. Perhaps we
should put that synchronization inside the versioning methods until
the concurrency issues are solved...

The problem here is that "versioning operations" covers quite a lot.
For us the real nasty is cloning nodes between workspaces, as we've
used a content model that maps releases to workspaces. Publishing a
release therefore involves cloning an entire workspace (which takes a
few 10s of minutes). During this period no other write operations can
take place. Putting synchronisation code inside the versioning methods
would mean that the entire application locks up during this period,
while having it outside in our own app means that we can be a bit more
flexible with how we handle locking (e.g. use locks that timeout with
an error rather than allowing the application to be completely locked
for 30-60 mins at a time).

There are a few areas of the code that cause this sort of problem -
the other big one is indexing. In order to support a home-brewed
failover mechanism for active-passive clustering we need to delete
search indexes on failover (as they are likely to be corrupt in the
event of failover). On subsequent startup the application needs to
reindex each workspace independently when it is first accessed. This
takes a few minutes to do, again locking users out while this takes
place.

I don't think there is a "quick fix" other than to go in and spend
some time fixing the existing scenarios where deadlock can occur and
doing some hardcore testing of concurrency issues.

Miro

Reply via email to