Michael Neale
Sat, 10 Mar 2007 21:50:32 -0800
Stefan - I think one of the main issues is the low priority of JCR-314 - it is much more serious then "minor" - my understanding is that any long running action - eg an upload of a large piece of content (file) - will basically block any other concurrent actions. For more fine grained CMS uses - this is probably not a problem (as content is read-mostly a lot of the time) - BUT, for people wanting to store large blobs (something that people would look at using JCR for) - this is a showstopper. Many transaction monitors in app servers have timeouts of 30 seconds - on the web or other tiers (of course this can be adjusted - but its not exactly a user friendly solution !). (my understanding may be wrong - I sincerely hope it is). On 3/10/07, Stefan Guggisberg <[EMAIL PROTECTED]> wrote:
On 3/10/07, Bryan Davis <[EMAIL PROTECTED]> wrote: > Stefan, > > There are a couple of issues that, collectively, we need to address in order > to successfully use Jackrabbit. > > Issue #1: Serialization of all repository updates. See > https://issues.apache.org/jira/browse/JCR-314, which I think seriously > understates the significance of the issue. In any environment where users > are routinely writing anything at all to the repository (like audit or log > information), a large file upload (or a small file over a slow link) will > effectively block all other users until it completes. > > Having all other threads hang while a file is being uploaded is simply a > show stopper for us (unfortunately this issue is marked as minor, reported > in 0.9, and not currently slated for a particular release). Trying to solve > this issue outside of Jackrabbit is impossible, providing only stopgap > solutions; plus external mitigation strategies (like uploading to a local > file and then streaming into Jackrabbit as fast as possible) all seem fairly > complex to make robust, seeing as how some data management would have to be > handled outside of the repository transaction. Which leaves us with trying > to resolve the issue by patching Jackrabbit. > > I now understand that the Jackrabbit fixes are multifaceted, and that (at > least) they involve changes to both the Persistence Manager and Shared Item > State Manager. The Persistence Manager changes (which I will talk about > separately), I think, are easy enough. The SISM obviously needs to be > upgraded to have more granular locking semantics (possibly by using > item-level nested locks, or maybe a partial solution that depends on > database-level locking in the Persistence Manager). > > There are a number of lock manager implementations floating around that > could potentially be repurposed for use inside the SISM. I am uncertain of > the requirements for distribution here, although on the surface it seems > like a local locking implementation is all that is required since it seems > like clustering support is handled at a higher level. > > It is also tempting to try and push this functionality into the database, > since it is already doing all of the required locking anyway. A custom > transaction manager that delegated to the repository session transaction > manager (thereby associating JDBC connections with the repository session), > in conjunction with a stock data source implementation (see below) might do > the trick. Of course this would only work with database PM¹s, but perhaps > other TM¹s could still have the existing SISM locking enabled. This would > be good enough for us since we only use database PM¹s, and a better, more > universal solution could be implemented at a later date. > > Has anyone looked into this issue at all or have any advice / thoughts? > > Issue #2: Multiple issues in database persistence managers. I believe the > database persistence managers have multiple issues (please correct me if I > get any of this wrong). bryan, thanks for sharing your elaborate analysis. however i don't agree with your analysis and proposals regarding the database persistence manager. i tried to explain my point in my previous replies. it seems like you either didn't read it, don't agree or don't care. that's all perfectly fine with me, but i am not gonna repeat myself. since you seem to know exactly what's wrong in the current implementation, feel free to create a jira issue and submit a pach with the proposed changes. btw: i agree that synchronization is bad if you don't understand it and use it incorrectly ;) cheers stefan > > 1. JDBC connection details should not be in the repository.xml. I should be > free to change the specifics of a particular database connection without it > constituting a repository initialization parameter change (which is > effectively what changing the repository.xml is, since it gets copied and > subsequently accessed from inside the repository itself). If a host name or > driver class or even connection URL changes, I should not have to manually > edit internal repository configuration files to effect the change. > 2. Sharing JDBC connections (and objects obtained from them, like prepared > statements) between multiple threads is not a good practice. Even though > many drivers support such activity, it is not specifically required by the > spec, and many drivers do not support it. Even for ones that do, there are > always a significant list of caveats (like changing the transaction > isolation of a connection impacting all threads, or rollbacks sometimes > being executed against the wrong thread). Plus, as far as I can tell, there > is also no particular good reason to attempt this in this case. Connection > pooling is extremely well understood and there are a variety of > implementations to choose from (including Apache¹s own in Jakarta Commons). > 3. Synchronization is bad (generally speaking of course :). Once the > multithreaded issues of JDBC are removed (via a connection pool), there are > no good reasons that I can see to have any synchronization in the database > persistence managers. Since any sort of requirement for synchronized > operation would be coming from a higher layer, it should also be provided at > a higher layer. I have always felt that a good rule of thumb in server code > is to avoid synchronization at all costs, particular in core server > functionality (like reading and writing to a repository). It if extremely > difficult to fully understand the global implications of synchronized code, > particular code that is synchronized at a low level. Any serialization of > user requests is extremely serious in a multithreaded server and, in my > experience, will lead to show-stopping performance and scalability issues in > nearly all cases. In addition, serialization of requests at such a low > level probably means that other synchronized code that is intended to be > properly multithreaded is probably not well tested since the request > serialization has eliminated (or greatly reduced) the possibility of the > code being reentered like it normally would be. > > The solution to all of these issues, maybe, is to use the standard JDBC > DataSource interface to encapsulate the details of managing the JDBC > connections. If all of the current PM and FS implementations that use JDBC > were refactored to have a DataSource member and to get and release > connections inside of each method, then parity with the existing > implementations could be achieved by providing a default DataSourceLookup > strategy implemention that simply encapsulated the existing connection > creation code (ignoring connection release requests). This would allow us > (and others) to externally extend the implementation with alternative > DataSourceLookup strategy implementations, say for accessing a datasource > from JNDI, or getting it from a Spring application context. This solution > also neatly externalizes all of the details of the actual datasource > configuration from the repository.xml. > > Thanks! > Bryan. > > > > On 3/8/07 2:48 AM, "Stefan Guggisberg" <[EMAIL PROTECTED]> wrote: > > > On 3/7/07, Bryan Davis <[EMAIL PROTECTED]> wrote: > >> Well, serializing on the prepared statement is still fairly serialized since > >> we are really only talking about nodes and properties (two locks instead of > >> one). If concurrency is controlled at a higher level then why is > >> synchronization in the PM necessary? > > > > a PM's implementation should be thread-safe because it might be used > > in another context or e.g. by a tool. > > > >> > >> The code now seems to assume that the connection object is thread-safe (and > >> the specifics for thread-safeness of of connection objects and other object > >> derived from them is pretty much up to the driver). This is one of the > >> reasons why connection pooling is used pretty much universally. > >> > >> If the built-in PM's used data sources instead of connections then the > >> connection settings could be more easily externalized (as these are > >> typically configurable by the end user). Is there anyway to external the > >> JDBC connection settings from repository.xml right now (in 1.2.2) and > >> configure them at runtime? > > > > i strongly disagree. the pm configuration is *not* supposed to be > > configurable by the end user and certainly not at runtime. do you think > > that e.g. the tablespace settings (physical datafile paths etc) of an oracle > > db should be user configurable? i hope not... > > > >> > >> You didn't really answer my question about Jackrabbit and its ability to > >> fetch and store information through the PM concurrently... What is the > >> synchronization at the higher level and how does it work? > > > > the current synchronization is used to guarantee data consistency (such as > > referential integrity). > > > > have a look at o.a.j.core.state.SharedItemStateManager#Update.begin() > > and you'll get the idea. > > > >> > >> Finally, we are seeing a new issue where if a particular user uploads a > >> large document all other users start to get exceptions (doing a normal mix > >> of mostly reads/some writes). If there is no way to do concurrent writes to > >> the PM I don't see any way around this problem (and it is pretty serious for > >> us). > > > > there's a related improvement issue: > > https://issues.apache.org/jira/browse/JCR-314 > > > > please feel free to comment on this issue or file a new issue if you think > > that it doesn't cover your use case. > > > > cheers > > stefan > > > >> > >> Bryan. > >> > >> > >> On 3/6/07 4:12 AM, "Stefan Guggisberg" <[EMAIL PROTECTED]> wrote: > >> > >>> On 3/5/07, Bryan Davis <[EMAIL PROTECTED]> wrote: > >>>> > >>>> > >>>> > >>>> On 3/3/07 7:11 AM, "Stefan Guggisberg" <[EMAIL PROTECTED]> wrote: > >>>> > >>>>> hi bryan > >>>>> > >>>>> On 3/2/07, Bryan Davis <[EMAIL PROTECTED]> wrote: > >>>>>> What persistence manager are you using? > >>>>>> > >>>>>> Our tests indicate that the stock persistence managers are a significant > >>>>>> bottleneck for both writes and also initial reads to load the transient > >>>>>> store (on the order of .5 seconds per node when using a remote database > >>>>>> like > >>>>>> MSSQL or Oracle). > >>>>> > >>>>> what do you mean by "load the transient store"? > >>>>> > >>>>>> > >>>>>> The stock db persistence managers have all methods marked as > >>>>>> "synchronized", > >>>>>> which blocks on the classdef (which means that even different persistence > >>>>>> managers for different workspaces will serialize all load, exists and > >>>>>> store > >>>>> > >>>>> assuming you're talking about DatabasePersistenceManager: > >>>>> the store/destroy methods are 'synchronized' on the instance, not on > >>>>> the 'classdef'. > >>>>> see e.g. > >>>>> > >>>>">>>>>"> http://java.sun.com/docs/books/tutorial/essential/concurrency/syncme > th.htm>>>>> > < http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm > > < http://java.sun.com/docs/books/tutorial/essential/concurrency/syncmeth.htm > l > >>>>> > >>>>> the load/exists methods are synchronized on the specific prepared stmt > >>>>> they're > >>>>> using. > >>>>> > >>>>> since every workspace uses its own persistence manager instance i can't > >>>>> follow your conclusion that all load, exists and store operations would be > >>>>> be globally serialized across all workspaces. > >>>> > >>>> Hm, this is my bad... It does seem that sync methods are on the instance. > >>>> Since the db persistence manager has "synchronized" on load, store and > >>>> exists, though, this would still serialize all of these operations for a > >>>> particular workspace. > >>> > >>> ?? the load methods are *not* synchronized. they contain a section which > >>> is synchronized on the particular prepared stmt. > >>> > >>> <quote from my previous reply> > >>> wrt synchronization: > >>> concurrency is controlled outside the persistence manager on a higher level. > >>> eliminating the method synchronization would imo therefore have *no* impact > >>> on concurrency/performance. > >>> </quote> > >>> > >>> cheers > >>> stefan > >>> > >>>> > >>>>>> operations). Presumably this is because they allocate a JDBC connection > >>>>>> at > >>>>>> startup and use it throughout, and the connection object is not > >>>>>> multithreaded. > >>>>> > >>>>> what leads you to this assumption? > >>>> > >>>> Are there other requirements that all of these operations are serialized > >>>> for > >>>> a particular PM instance? This seems like a pretty serious bottleneck > >>>> (and, > >>>> in fact, is a pretty serious bottleneck when the database is remote from > >>>> the > >>>> repository). > >>>> > >>>>>> > >>>>>> This problem isn't as noticeable when you are using embedded Derby and > >>>>>> reading/writing to the file system, but when you are doing a network > >>>>>> operation to a database server, the network latency in combination with > >>>>>> the > >>>>>> serialization of all database operations results in a significant > >>>>>> performance degradation. > >>>>> > >>>>> again: serialization of 'all' database operations? > >>>> > >>>> The distinction between all and all for a workspace is would really only be > >>>> relevant during versioning, right? > >>>> > >>>>>> > >>>>>> The new bundle persistence manager (which isn't yet in SVN) improves > >>>>>> things > >>>>>> dramatically since it inlines properties into the node, so loading or > >>>>>> persisting a node is only one operation (plus the additional connection > >>>>>> for > >>>>>> the LOB) instead of one for the node and and one for each property. The > >>>>>> bundle persistence manager also uses prepared statements and keeps a > >>>>>> PM-level cache of nodes (with properties) and also non-existent nodes > >>>>>> (which > >>>>>> permits many exists() calls to return without accessing the database). > >>>>>> > >>>>>> Changing all db persistence managers to use a datasource and get and > >>>>>> release > >>>>>> connections inside of load, exists and store operations and eliminating > >>>>>> the > >>>>>> method synchronization is a relatively simple change that further > >>>>>> improves > >>>>>> performance for connecting to database servers. > >>>>> > >>>>> the use of datasources, connection pools and the like have been discussed > >>>>> in extenso on the list. see e.g. > >>>>> > >> > ">"> http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm > > < http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm > > < http://www.mail-archive.com/jackrabbit-dev@incubator.apache.org/msg05181.htm > > >> >> > >> l > >>>>> http://issues.apache.org/jira/browse/JCR-313 > >>>>> > >>>>> i don't see how getting & releasing connections in every load, exists and > >>>>> store > >>>>> call would improve preformance. could you please elaborate? > >>>>> > >>>>> please note that you wouldn't be able to use prepared statements over > >>>>> multiple > >>>>> load, store etc operations because you'd have to return the connection > >>>>> at the end > >>>>> of every call. the performance might therefore be even worse. > >>>>> > >>>>> further note that write operations must occur within a single jdbc > >>>>> transaction, i.e. > >>>>> you can't get a new connection for every store/destroy operation. > >>>>> > >>>>> wrt synchronization: > >>>>> concurrency is controlled outside the persistence manager on a higher > >>>>> level. > >>>>> eliminating the method synchronization would imo therefore have *no* > >>>>> impact > >>>>> on concurrency/performance. > >>>> > >>>> So you are saying that it is impossible to concurrently load or store data > >>>> in Jackrabbit? > >>>> > >>>>>> There is a persistence manager with an ASL license called > >>>>>> "DataSourcePersistenceManager" which seems to the PM of choice for people > >>>>>> using Magnolia (which is backed by Jackrabbit). It also uses prepared > >>>>>> statements and eliminates the current single-connection issues associated > >>>>>> with all of the stock db PMs. It doesn't seem to have been submitted > >>>>>> back > >>>>>> to the Jackrabbit project. If you Google for > >>>>>> " com.iorgagroup.jackrabbit.core.state.db.DataSourcePersistenceManager" > >>>>>> you > >>>>>> should be able to find it. > >>>>> > >>>>> thanks for the hint. i am aware of this pm and i had a look at it a couple > >>>>> of > >>>>> months ago. the major issue was that it didn't implement the > >>>>> correct/required > >>>>> semantics. it used a new connection for every write operation which > >>>>> clearly violates the contract that the write operations should occur > >>>>> within > >>>>> a jdbc transaction bracket. further it creates a prepared stmt on every > >>>>> load, store etc. which is hardly efficient... > >>>> > >>>> Yes, this PM does have this issue. The bundle PM implements prepared > >>>> statements in the correct way. > >>>> > >>>>>> Finally, if you always use the Oracle 10g JDBC drivers, you do not need > >>>>>> to > >>>>>> use the Oracle-specific PMs because the 10g drivers support the standard > >>>>>> BLOB API (in addition to the Oracle-specific BLOB API required by the > >>>>>> older > >>>>>> 9i drivers). This is true even if you are connecting to an older > >>>>>> database > >>>>>> server as the limitation was in the driver itself. Frankly you should > >>>>>> never > >>>>>> use the 9i drivers as they are pretty buggy and the 10g drivers represent > >>>>>> a > >>>>>> complete rewrite. Make sure you use the new driver package because the > >>>>>> 10g > >>>>>> driver JAR also includes the older 9i drivers for backward-compatibility. > >>>>>> The new driver is in a new package (can't remember the exact name off the > >>>>>> top of my head). > >>>>> > >>>>> thanks for the information. > >>>>> > >>>>> cheers > >>>>> stefan > >>>> > >>>> We are very interested in getting a good understanding of the specifics of > >>>> how PM's work, as initial reads and writes, according to our profiling, are > >>>> spending 80-90% of the time inside the PM. > >>>> > >>>> Bryan. > >>>> > >>>> _______________________________________________________________________ > >>>> Notice: This email message, together with any attachments, may contain > >>>> information of BEA Systems, Inc., its subsidiaries and affiliated > >>>> entities, that may be confidential, proprietary, copyrighted and/or > >>>> legally privileged, and is intended solely for the use of the individual > >>>> or entity named in this message. If you are not the intended recipient, > >>>> and have received this message in error, please immediately return this > >>>> by email and then delete it. > >>>> > >> > >> _______________________________________________________________________ > >> Notice: This email message, together with any attachments, may contain > >> information of BEA Systems, Inc., its subsidiaries and affiliated > >> entities, that may be confidential, proprietary, copyrighted and/or > >> legally privileged, and is intended solely for the use of the individual > >> or entity named in this message. If you are not the intended recipient, > >> and have received this message in error, please immediately return this > >> by email and then delete it. > >> > > _______________________________________________________________________ > Notice: This email message, together with any attachments, may contain > information of BEA Systems, Inc., its subsidiaries and affiliated > entities, that may be confidential, proprietary, copyrighted and/or > legally privileged, and is intended solely for the use of the individual > or entity named in this message. If you are not the intended recipient, > and have received this message in error, please immediately return this > by email and then delete it. >