Hi Bryan,
Bryan Davis wrote:
It is also tempting to try and push this functionality into the database,
since it is already doing all of the required locking anyway. A custom
transaction manager that delegated to the repository session transaction
manager (thereby associating JDBC connections with the repository session),
in conjunction with a stock data source implementation (see below) might do
the trick. Of course this would only work with database PM¹s, but perhaps
other TM¹s could still have the existing SISM locking enabled. This would
be good enough for us since we only use database PM¹s, and a better, more
universal solution could be implemented at a later date.
Has anyone looked into this issue at all or have any advice / thoughts?
well, it's not that easy. locking in SISM is required because there is a cache
of item states. the same requirements for a persistence manager also applies to
this cache. specifically, after the change log got persisted the cache must be
updated atomically.
there were some recent thoughts about implementing a multi-version cache
(similar to MVCC in databases), which would support better concurrency. but that
requires more effort than just removing some synchronized modifiers in the
source code.
Issue #2: Multiple issues in database persistence managers. I believe the
database persistence managers have multiple issues (please correct me if I
get any of this wrong).
1. JDBC connection details should not be in the repository.xml. I should be
free to change the specifics of a particular database connection without it
constituting a repository initialization parameter change (which is
effectively what changing the repository.xml is, since it gets copied and
subsequently accessed from inside the repository itself). If a host name or
driver class or even connection URL changes, I should not have to manually
edit internal repository configuration files to effect the change.
there's a
org.apache.jackrabbit.core.persistence.db.JNDIDatabasePersistenceManager, which
exactly does what you need.
2. Sharing JDBC connections (and objects obtained from them, like prepared
statements) between multiple threads is not a good practice. Even though
many drivers support such activity, it is not specifically required by the
spec, and many drivers do not support it. Even for ones that do, there are
always a significant list of caveats (like changing the transaction
isolation of a connection impacting all threads, or rollbacks sometimes
being executed against the wrong thread). Plus, as far as I can tell, there
is also no particular good reason to attempt this in this case. Connection
pooling is extremely well understood and there are a variety of
implementations to choose from (including Apache¹s own in Jakarta Commons).
so far we didn't see any issues with our approach of sharing a connection among
threads. we are always happy to accept contributions in that area ;)
3. Synchronization is bad (generally speaking of course :).
one could also say (generally speaking) 'no synchronization' is bad. if you have
a multi-threaded program without any synchronization and is super-fast, but just
doesn't behave correctly, what good is it then?
Once the
multithreaded issues of JDBC are removed (via a connection pool), there are
no good reasons that I can see to have any synchronization in the database
persistence managers.
that's correct but won't solve your issue, because of JCR-314.
btw. you can easily implement your own persistence manager based on the
JNDIDatabasePersistenceManager that provides concurrent reads. I think we didn't
implement one so far because we didn't see much value in it.
Since any sort of requirement for synchronized
operation would be coming from a higher layer, it should also be provided at
a higher layer. I have always felt that a good rule of thumb in server code
is to avoid synchronization at all costs, particular in core server
functionality (like reading and writing to a repository). It if extremely
difficult to fully understand the global implications of synchronized code,
particular code that is synchronized at a low level. Any serialization of
user requests is extremely serious in a multithreaded server and, in my
experience, will lead to show-stopping performance and scalability issues in
nearly all cases. In addition, serialization of requests at such a low
level probably means that other synchronized code that is intended to be
properly multithreaded is probably not well tested since the request
serialization has eliminated (or greatly reduced) the possibility of the
code being reentered like it normally would be.
please note that this serialization of writes currently only takes place at the
very last step when changes are stored. all other operations like constraint and
node type validation, access right checks, check if the save is self contained,
etc. can be done with multiple threads concurrently.
The solution to all of these issues, maybe, is to use the standard JDBC
DataSource interface to encapsulate the details of managing the JDBC
connections. If all of the current PM and FS implementations that use JDBC
were refactored to have a DataSource member and to get and release
connections inside of each method, then parity with the existing
implementations could be achieved by providing a default DataSourceLookup
strategy implemention that simply encapsulated the existing connection
creation code (ignoring connection release requests).
I can't follow you here, how would that help?
This would allow us
(and others) to externally extend the implementation with alternative
DataSourceLookup strategy implementations, say for accessing a datasource
from JNDI, or getting it from a Spring application context. This solution
also neatly externalizes all of the details of the actual datasource
configuration from the repository.xml.
using JNDI is already possible and implementing a database persistence manager
which obtains its connection from a spring context is fairly easy. you just have
to implement the method DatabasePersistenceManager.getConnection() accordingly.
regards
marcel