Re: testing TDB-Tx

Simon Helsen Mon, 08 Aug 2011 09:35:13 -0700

Andy,

thanks for the fix. With regard to the OME, that is a very strange one. I 
first did some tests using the IBM JRE 6 and using the eclipse memory 
analyzer (see 
http://wiki.eclipse.org/index.php/MemoryAnalyzer#System_Dumps_and_Heap_Dumps_from_IBM_Virtual_Machines
 
if you're interested) I noticed that my 2Gb heap dump only showed about 
110Mb of real data. My initial max heap was set to 1.5Gb and stack at 
512Mb. I am not quite sure how that is possible except if the problem sits 
1) in native memory or 2) there is a sudden spike in dynamic memory 
generation which does not give the GC time to collect before the JVM 
bails.


My next test was to use Orcale's JRE 6 which did not start at first 
because maxPermGen wasn't large enough, so I increased it to 256Mb (the 
default is 64Mb) and then around the same time, I also ran into an OME but 
I don't see to get a useful heap dump to analyze. And I suspect it would 
show something similar to what I saw with the IBM JRE.

I also tried just increasing the heap size from 1.5Gb to 3Gb (I have 8 
total on my machine), and although that delays the problem, it does not 
actually avoid it and in fact, the moment the memory spikes (it happens 
really fast), my entire machine freezes as all my memory is more or less 
gone.

A few things about the unit tests I am using. We have quite a few tests 
which "disconnect" and "connect" to the Jena dataset by opening and 
closing them (even moving the indices physically around, etc.) I suspect 
the problem is related to this, but unfortunately, I cannot pin it to one 
particular test since I can "move" the problem along by skipping tests. I 
should add this entire OME problem is a regression from our point of view 
since we did not observe anything like this in 0.8.7. I also have to 
emphasize that this is still without using the transaction API. I may give 
it a shot with the new API, but I doubt it will resolve the issue.

This brings me to a question. Is calling reset on the StoreProvider to 
release all resources and allow me to remove the indices from disk? 

thanks,

Simon




From:
Andy Seaborne <[email protected]>
To:
[email protected]
Date:
08/08/2011 11:20 AM
Subject:
Re: testing TDB-Tx





On 05/08/11 19:01, Simon Helsen wrote:
> I tested my stuff in mapped mode which did not show the problem, so the
> issue I encountered is specific to direct mode. IMO the code below
> contains the problem and needs to be fixed with a call to
> blockMgrCache.getWrite (on the wrapped BlockMgr) whenever there is a
> cache miss.
>
> @Andy: could you fix this for the next build?

Done in SVN.

We've lost the repo (AWS ate the server) [temporarily] so for a bit, 
it's "svn update; mvn clean package"

> I still hit the OME though. I will try to analyze the stack dumps to see
> if there is anything special. When I hit the OME, it comes very quick,
> i.e. in a matter of seconds, my entire heap space is exhausted from
> regular heap usage before.

So if I read the thread right, calling BulkUpdateHandlerTDB.removeAll 
still goes bad sometimes.

                 Andy

>
> Simon
>
> Inactive hide details for Simon Helsen---08/05/2011 01:27:25 PM---Ok, so
> I looked at the code in BlockMgrCache and I notice thaSimon
> Helsen---08/05/2011 01:27:25 PM---Ok, so I looked at the code in
> BlockMgrCache and I notice that getWrite is implemented like this:
>
>
> From: 
> Simon Helsen/Toronto/IBM
>
> To: 
> [email protected]
>
> Cc: 
> [email protected]
>
> Date: 
> 08/05/2011 01:27 PM
>
> Subject: 
> Re: testing TDB-Tx
>
> ------------------------------------------------------------------------
>
>
> Ok, so I looked at the code in BlockMgrCache and I notice that getWrite
> is implemented like this:
>
> @Override
> *synchronized*
> *public*Block getWrite(*long*_id)
> {
> Long id = Long./valueOf/(_id) ;
> Block blk = *null*;
> *if*( writeCache!= *null*)
> blk = writeCache.get(id) ;
> *if*( blk != *null*)
> {
> cacheWriteHits++ ;
> log("Hit(w->w) : %d", id) ;
> *return*blk ;
> }
>
> // blk is null.
> // A requested block may be in the other cache. Promote it.
>
> *if*( readCache.containsKey(id) )
> {
> blk = readCache.get(id) ;
> cacheReadHits++ ;
> log("Hit(w->r) : %d", id) ;
> blk = promote(blk) ;
> *return*blk ;
> }
>
> // Did not find.
> cacheMisses++ ;
> log("Miss/w: %d", id) ;
> *if*( writeCache!= *null*)
> writeCache.put(id, blk) ;
> *return*blk ;
> }
>
> Now, in my particular case, the id to come in is 0, but neither cache
> contains the value. In this case, it will put the entry {0 = null} in
> the write cache which necessarily leads to the NPE in the caller. So I
> am not quite following the logic here. I would expect that if there is a
> cache miss, the wrapped block mgr would be used to obtain block before
> it is written in the writeCache.
>
> Simon
>
>
> Inactive hide details for Simon Helsen---08/05/2011 12:01:57 PM---Paolo,
> I don't know who wrote the code, but it would help if Simon
> Helsen---08/05/2011 12:01:57 PM---Paolo, I don't know who wrote the
> code, but it would help if a first analysis is
>
>
> From: 
> Simon Helsen/Toronto/IBM@IBMCA
>
> To: 
> [email protected]
>
> Cc: 
> [email protected]
>
> Date: 
> 08/05/2011 12:01 PM
>
> Subject: 
> Re: testing TDB-Tx
>
> ------------------------------------------------------------------------
>
>
>
> Paolo,
>
> I don't know who wrote the code, but it would help if a first analysis 
is
> done with the stack trace I provided and perhaps other questions that 
can
> help identify the problem and a possible fix. Producing sharable code
> which reproduces the problem is not trivial and may not even be possible
> since we run in a rather complex framework. If possible, I will try to
> debug myself from within our framework but obviously, I have limited
> knowledge of the details of the PageBlockMgr.
>
> All the instances of this stack trace (and I am seeing quite a few of
> them) seem to come from BulkUpdateHandlerTDB.removeAll, but I know that
> removeAll initially works fine (until the NPE occurs the first time - it
> seems that after the first time, it keeps happening). I will also try to
> isolate the problem more to see if there is anything specific that 
brings
> the store in this situation
>
> thanks
>
> Simon
>
>
>
> From:
> Paolo Castagna <[email protected]>
> To:
> [email protected]
> Date:
> 08/05/2011 10:46 AM
> Subject:
> Re: testing TDB-Tx
>
>
>
> Hi Simon,
> I don't have an answer or a solution to your problem, but I want to 
thank
> you for reporting your experience (and the problems you found) on
> jena-dev.
>
> It would be extremely helpful if you could reproduce the problem with 
some
> sharable code we can run and debug. I know, I know... it's not always 
easy
> nor possible.
>
> I hit a problem using TestTransSystem.java which runs multiple threads 
and
> it's not easy to replicate.
>
> Thanks again and keep sharing on jena-dev, this way everybody can 
benefit.
>
> Cheers,
> Paolo
>
> Simon Helsen wrote:
>  > Hi everyone,
>  >
>  > I am giving a first stab at integrating TDB-Tx into our framework. My
>  > first goal is to test this new TDB *without* actually using the
>  > transaction API because we are coming from TDB 0.8.7. After some 
minor
>  > problems on our end, I seem to run into the following NPE (usually
>  > followed by a warning)
>  >
>  > 09:49:02,176 [jazz.jfs.suspending.indexer.internal.triple] ERROR
>  > com.ibm.team.jfs - CRJZS5663E Unable
> to
>  > persist tripe index
>  > java.lang.NullPointerException
>  > at com.hp.hpl.jena.tdb.base.page.PageBlockMgr.getWrite(
>  > PageBlockMgr.java:50)
>  > at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.getMgrWrite(
>  > BPTreeNode.java:162)
>  > at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get(
>  > BPTreeNode.java:145)
>  > at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.delete(
>  > BPTreeNode.java:227)
>  > at
>  > com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.deleteAndReturnOld(
>  > BPlusTree.java:324)
>  > at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.delete(
>  > BPlusTree.java:318)
>  > at com.hp.hpl.jena.tdb.index.TupleIndexRecord.performDelete(
>  > TupleIndexRecord.java:55)
>  > at com.hp.hpl.jena.tdb.index.TupleIndexBase.delete(
>  > TupleIndexBase.java:61)
>  > at
> com.hp.hpl.jena.tdb.index.TupleTable.delete(TupleTable.java:108
>  > )
>  > at com.hp.hpl.jena.tdb.graph.BulkUpdateHandlerTDB.removeWorker(
>  > BulkUpdateHandlerTDB.java:136)
>  > at com.hp.hpl.jena.tdb.graph.BulkUpdateHandlerTDB.removeAll(
>  > BulkUpdateHandlerTDB.java:90)
>  > at com.hp.hpl.jena.rdf.model.impl.ModelCom.removeAll(
>  > ModelCom.java:315)
>  > ...
>  > 09:49:02,207 [jazz.jfs.suspending.indexer.internal.triple] WARN
>  > com.hp.hpl.jena.tdb.base.block.BlockMgrCache - Write cache: 0
>  > expelling entry that isn't there
>  >
>  > The exception sits all over my log and I wonder if it is related to 
the
>  > removeAll. Also, after a while, my memory spikes and I run into an 
OME.
> I
>  > don't know yet if there is a relation, but possible these exceptions
> cause
>  > serious leaks.
>  >
>  > The version of TDB (and associated libs) I am using is
>  > tx-tdb-0.9.0-20110802.083904-6
>  >
>  > thanks,
>  >
>  > Simon
>
>
>
>
>

Re: testing TDB-Tx

Reply via email to