Miro Walker wrote:
We've been discussing the DB PM implementation, and have a couple of
questions regarding the implementation of this. At the moment, the
Simple DB PM appears to have been implemented using a single connection
with all write operations synchronised on a single object. This would
imply that all writes to the database are single threaded, effectively
making any application using it also run single threaded for write
operations. This appears to have two implications:
this is not quite true. the actual store operation on the persistence
manager is synchronized. however most of the write calls from different
threads to the JCR api in jackrabbit will not block each other because
those changes are made in a private transient scope. only the final save
or commit of the transaction is serialized. that's only one part of the
whole write process.
1. Performance - in a multi-user system, having single-threaded writes
to the database will make the JDBC connection a serious bottleneck as
soon as the application comes under load. It also means that any
background processing that needs to iterate over the repository making
changes (and we have a few of those) will effectively bring all other
users to a grinding halt.
this depends very much on the use case. again, all changes that such a
background process does, are first made in a transient scope and other
sessions are only affected if at all when the changes are stored in the
persistence manager.
while one session stores changes, other sessions are still able to read
certain items, as long as those are available in the
LocalItemStateManager. Only when other sessions access item that are not
available in their LocalItemStateManager they will be blocked until the
store is finished.
2. Transactions - we haven't tested this (as the recent support for
transactions in versioning operations has not been integrated into our
system), but it appears that to if a single connection is being used,
then we can only have a single transaction active at any one time. So,
if each user tries to execute a transaction with multiple write
operations in it, and these transactions are to be propagated through to
the database, then each transaction must complete before the next can
begin. This would either mean we get exceptions if the system attempts
to interleave operations from different transactions or that each
transaction must complete in full before another can begin, further
compounding the performance issue.
the scopes of a JCR transaction and a transaction on the underlying
database that is used by jackrabbit are not the same. A JCR transaction
starts with the first modified item, whereas the transaction of the
underlying database starts with the call to Item.save() or
Session.save() or the JTA transaction commit (whatever you prefer ;)).
that basically means JCR transactions can run in parallel for most of
the time, only the commit phase of the JCR transaction is serialized.
In addition to the implications of using a single synchronised
connection, another issue appears to be that the system will be unable
to recover from a connection failure. For example, if the system were
deployed onto a highly available database cluster, then in the event of
DB instance failure, any open connections will be killed, but can quite
happily be reopened later. Jackrabbit appears to create a connection on
initialisation, and has no way to recover if that connection is killed.
This is certainly an issue with the SimpleDbPersistenceManager. I guess
that's why it is called Simple...
IMO the purpose of the SimpleDbPersistenceManager is mainly embedded
databases where a connection failure is highly unlikely, as there is no
network in between.
I know that questions around implementing support for connection pooling
on the DB have been raised before and then dismissed as unimportant, but
this appears to me to be pretty fundamental. By using a connection pool
implementation that supports recreating dead connections and supports
providing tying a connection to a transaction context, multiple
transactions could run in parallel, helping throughput and making the
system more reliable.
even if such a persistence manager allows concurrent writes, it is still
the responsibility of the caller to ensure consistency. in our case
that's the SharedItemStateManager. And that's the place where
transactions are currently serialized, but only on commit.
If concurrent write performance should become a real issue that's where
we first have to deal with it.
regards
marcel