My best advice for now has been to explicitly synchronize on the repository instance whenever you are doing versioning operations. Note that you can still do normal read and write operations concurrently with versioning, so this isn't as bad as it could be. Perhaps we should put that synchronization inside the versioning methods until the concurrency issues are solved...
The problem here is that "versioning operations" covers quite a lot. For us the real nasty is cloning nodes between workspaces, as we've used a content model that maps releases to workspaces. Publishing a release therefore involves cloning an entire workspace (which takes a few 10s of minutes). During this period no other write operations can take place. Putting synchronisation code inside the versioning methods would mean that the entire application locks up during this period, while having it outside in our own app means that we can be a bit more flexible with how we handle locking (e.g. use locks that timeout with an error rather than allowing the application to be completely locked for 30-60 mins at a time). There are a few areas of the code that cause this sort of problem - the other big one is indexing. In order to support a home-brewed failover mechanism for active-passive clustering we need to delete search indexes on failover (as they are likely to be corrupt in the event of failover). On subsequent startup the application needs to reindex each workspace independently when it is first accessed. This takes a few minutes to do, again locking users out while this takes place. I don't think there is a "quick fix" other than to go in and spend some time fixing the existing scenarios where deadlock can occur and doing some hardcore testing of concurrency issues. Miro
