Andy, thanks for the fix. With regard to the OME, that is a very strange one. I first did some tests using the IBM JRE 6 and using the eclipse memory analyzer (see http://wiki.eclipse.org/index.php/MemoryAnalyzer#System_Dumps_and_Heap_Dumps_from_IBM_Virtual_Machines if you're interested) I noticed that my 2Gb heap dump only showed about 110Mb of real data. My initial max heap was set to 1.5Gb and stack at 512Mb. I am not quite sure how that is possible except if the problem sits 1) in native memory or 2) there is a sudden spike in dynamic memory generation which does not give the GC time to collect before the JVM bails.
My next test was to use Orcale's JRE 6 which did not start at first because maxPermGen wasn't large enough, so I increased it to 256Mb (the default is 64Mb) and then around the same time, I also ran into an OME but I don't see to get a useful heap dump to analyze. And I suspect it would show something similar to what I saw with the IBM JRE. I also tried just increasing the heap size from 1.5Gb to 3Gb (I have 8 total on my machine), and although that delays the problem, it does not actually avoid it and in fact, the moment the memory spikes (it happens really fast), my entire machine freezes as all my memory is more or less gone. A few things about the unit tests I am using. We have quite a few tests which "disconnect" and "connect" to the Jena dataset by opening and closing them (even moving the indices physically around, etc.) I suspect the problem is related to this, but unfortunately, I cannot pin it to one particular test since I can "move" the problem along by skipping tests. I should add this entire OME problem is a regression from our point of view since we did not observe anything like this in 0.8.7. I also have to emphasize that this is still without using the transaction API. I may give it a shot with the new API, but I doubt it will resolve the issue. This brings me to a question. Is calling reset on the StoreProvider to release all resources and allow me to remove the indices from disk? thanks, Simon From: Andy Seaborne <[email protected]> To: [email protected] Date: 08/08/2011 11:20 AM Subject: Re: testing TDB-Tx On 05/08/11 19:01, Simon Helsen wrote: > I tested my stuff in mapped mode which did not show the problem, so the > issue I encountered is specific to direct mode. IMO the code below > contains the problem and needs to be fixed with a call to > blockMgrCache.getWrite (on the wrapped BlockMgr) whenever there is a > cache miss. > > @Andy: could you fix this for the next build? Done in SVN. We've lost the repo (AWS ate the server) [temporarily] so for a bit, it's "svn update; mvn clean package" > I still hit the OME though. I will try to analyze the stack dumps to see > if there is anything special. When I hit the OME, it comes very quick, > i.e. in a matter of seconds, my entire heap space is exhausted from > regular heap usage before. So if I read the thread right, calling BulkUpdateHandlerTDB.removeAll still goes bad sometimes. Andy > > Simon > > Inactive hide details for Simon Helsen---08/05/2011 01:27:25 PM---Ok, so > I looked at the code in BlockMgrCache and I notice thaSimon > Helsen---08/05/2011 01:27:25 PM---Ok, so I looked at the code in > BlockMgrCache and I notice that getWrite is implemented like this: > > > From: > Simon Helsen/Toronto/IBM > > To: > [email protected] > > Cc: > [email protected] > > Date: > 08/05/2011 01:27 PM > > Subject: > Re: testing TDB-Tx > > ------------------------------------------------------------------------ > > > Ok, so I looked at the code in BlockMgrCache and I notice that getWrite > is implemented like this: > > @Override > *synchronized* > *public*Block getWrite(*long*_id) > { > Long id = Long./valueOf/(_id) ; > Block blk = *null*; > *if*( writeCache!= *null*) > blk = writeCache.get(id) ; > *if*( blk != *null*) > { > cacheWriteHits++ ; > log("Hit(w->w) : %d", id) ; > *return*blk ; > } > > // blk is null. > // A requested block may be in the other cache. Promote it. > > *if*( readCache.containsKey(id) ) > { > blk = readCache.get(id) ; > cacheReadHits++ ; > log("Hit(w->r) : %d", id) ; > blk = promote(blk) ; > *return*blk ; > } > > // Did not find. > cacheMisses++ ; > log("Miss/w: %d", id) ; > *if*( writeCache!= *null*) > writeCache.put(id, blk) ; > *return*blk ; > } > > Now, in my particular case, the id to come in is 0, but neither cache > contains the value. In this case, it will put the entry {0 = null} in > the write cache which necessarily leads to the NPE in the caller. So I > am not quite following the logic here. I would expect that if there is a > cache miss, the wrapped block mgr would be used to obtain block before > it is written in the writeCache. > > Simon > > > Inactive hide details for Simon Helsen---08/05/2011 12:01:57 PM---Paolo, > I don't know who wrote the code, but it would help if Simon > Helsen---08/05/2011 12:01:57 PM---Paolo, I don't know who wrote the > code, but it would help if a first analysis is > > > From: > Simon Helsen/Toronto/IBM@IBMCA > > To: > [email protected] > > Cc: > [email protected] > > Date: > 08/05/2011 12:01 PM > > Subject: > Re: testing TDB-Tx > > ------------------------------------------------------------------------ > > > > Paolo, > > I don't know who wrote the code, but it would help if a first analysis is > done with the stack trace I provided and perhaps other questions that can > help identify the problem and a possible fix. Producing sharable code > which reproduces the problem is not trivial and may not even be possible > since we run in a rather complex framework. If possible, I will try to > debug myself from within our framework but obviously, I have limited > knowledge of the details of the PageBlockMgr. > > All the instances of this stack trace (and I am seeing quite a few of > them) seem to come from BulkUpdateHandlerTDB.removeAll, but I know that > removeAll initially works fine (until the NPE occurs the first time - it > seems that after the first time, it keeps happening). I will also try to > isolate the problem more to see if there is anything specific that brings > the store in this situation > > thanks > > Simon > > > > From: > Paolo Castagna <[email protected]> > To: > [email protected] > Date: > 08/05/2011 10:46 AM > Subject: > Re: testing TDB-Tx > > > > Hi Simon, > I don't have an answer or a solution to your problem, but I want to thank > you for reporting your experience (and the problems you found) on > jena-dev. > > It would be extremely helpful if you could reproduce the problem with some > sharable code we can run and debug. I know, I know... it's not always easy > nor possible. > > I hit a problem using TestTransSystem.java which runs multiple threads and > it's not easy to replicate. > > Thanks again and keep sharing on jena-dev, this way everybody can benefit. > > Cheers, > Paolo > > Simon Helsen wrote: > > Hi everyone, > > > > I am giving a first stab at integrating TDB-Tx into our framework. My > > first goal is to test this new TDB *without* actually using the > > transaction API because we are coming from TDB 0.8.7. After some minor > > problems on our end, I seem to run into the following NPE (usually > > followed by a warning) > > > > 09:49:02,176 [jazz.jfs.suspending.indexer.internal.triple] ERROR > > com.ibm.team.jfs - CRJZS5663E Unable > to > > persist tripe index > > java.lang.NullPointerException > > at com.hp.hpl.jena.tdb.base.page.PageBlockMgr.getWrite( > > PageBlockMgr.java:50) > > at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.getMgrWrite( > > BPTreeNode.java:162) > > at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get( > > BPTreeNode.java:145) > > at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.delete( > > BPTreeNode.java:227) > > at > > com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.deleteAndReturnOld( > > BPlusTree.java:324) > > at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.delete( > > BPlusTree.java:318) > > at com.hp.hpl.jena.tdb.index.TupleIndexRecord.performDelete( > > TupleIndexRecord.java:55) > > at com.hp.hpl.jena.tdb.index.TupleIndexBase.delete( > > TupleIndexBase.java:61) > > at > com.hp.hpl.jena.tdb.index.TupleTable.delete(TupleTable.java:108 > > ) > > at com.hp.hpl.jena.tdb.graph.BulkUpdateHandlerTDB.removeWorker( > > BulkUpdateHandlerTDB.java:136) > > at com.hp.hpl.jena.tdb.graph.BulkUpdateHandlerTDB.removeAll( > > BulkUpdateHandlerTDB.java:90) > > at com.hp.hpl.jena.rdf.model.impl.ModelCom.removeAll( > > ModelCom.java:315) > > ... > > 09:49:02,207 [jazz.jfs.suspending.indexer.internal.triple] WARN > > com.hp.hpl.jena.tdb.base.block.BlockMgrCache - Write cache: 0 > > expelling entry that isn't there > > > > The exception sits all over my log and I wonder if it is related to the > > removeAll. Also, after a while, my memory spikes and I run into an OME. > I > > don't know yet if there is a relation, but possible these exceptions > cause > > serious leaks. > > > > The version of TDB (and associated libs) I am using is > > tx-tdb-0.9.0-20110802.083904-6 > > > > thanks, > > > > Simon > > > > >
