----- Original Message -----
> From:Frank Budinsky <[email protected]>
> To:[email protected]
> Cc:
> Sent:Thursday, February 17, 2011 8:28 AM
> Subject:Re: OutOfMemoryError while loading datagraphs
>
>
> Hi Andy,
>
> Andy Seaborne <[email protected]> wrote on 02/17/2011 08:35:39
> AM:
>
> > [image removed]
> >
> > Re: OutOfMemoryError while loading datagraphs
> >
> > Andy Seaborne
> >
> > to:
> >
> > jena-users
> >
> > 02/17/2011 08:37 AM
> >
> > Please respond to jena-users
> >
> >
> >
> > On 16/02/11 18:47, Frank Budinsky wrote:
> > >
> > > Hi,
> >
> > Hi Frank,
> >
> > > I am trying to load about 100,000 datagraphs (roughly 10M triples)
> into
> a
> > > Jena TDB Dataset, but am running out of memory.
> >
> > 10M total I hope :-)
>
> Yes, that's the total for this experiment. Would you say that is getting to
> the upper limit of what's possible?
>
> >
> > Is this on a 32 bit machine or a 64 bit machine? Also, which JVM is it?
>
> 32 bit machine and standard 1.6 JVM.
>
> >
> > > I'm doing the load by
> > > repeatedly calling code that looks something like this:
> > >
> > > InputStream instream = entity.getContent(); // the RDF graph to
> load
> >
> > An input stream of RDF/XML bytes.
>
> Yes.
>
> >
> > What does the data look like?
>
> Pretty standard RDF/XML, ranging in size from 50 - 400 lines of XML. Here's
> one example:
>
> <rdf:RDF
> xmlns:rtc_ext="http://jazz.net/xmlns/prod/jazz/rtc/ext/1.0/"
> xmlns:rtc_cm="http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/"
> xmlns:oslc_cm="http://open-services.net/ns/cm#"
> xmlns:dcterms="http://purl.org/dc/terms/"
> xmlns:oslc_cmx="http://open-services.net/ns/cm-x#"
> xmlns:oslc="http://open-services.net/ns/core#"
> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" >
> <rdf:Description rdf:nodeID="A0">
> <rdf:subject
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/8"/>
> <rdf:predicate
> rdf:resource="http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.parentworkitem.children"/>
> <rdf:object
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/9"/>
> <rdf:type
> rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement"/>
> <dcterms:title>9: Frank's second defect -
> modified</dcterms:title>
> </rdf:Description>
> <rdf:Description
> rdf:about="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/8">
> <rtc_cm:estimate></rtc_cm:estimate>
> <rtc_cm:progressTracking
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workitems/_4EDUsOHwEd-OKOGoixAXCg/progressTracking"/>
> <rdf:type
> rdf:resource="http://open-services.net/ns/cm#ChangeRequest"/>
> <oslc:serviceProvider
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/contexts/_9ZDK0OHtEd-OKOGoixAXCg/workitems/services"/>
> <dcterms:creator
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_fivmQOHpEd-ts-6cFr6pTw"/>
> <rtc_cm:type
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/types/_9ZDK0OHtEd-OKOGoixAXCg/defect"/>
> <oslc_cmx:project
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/projectareas/_9ZDK0OHtEd-OKOGoixAXCg"/>
> <rtc_cm:state
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workflows/_9ZDK0OHtEd-OKOGoixAXCg/states/bugzillaWorkflow/2"/>
> <rtc_cm:timeSheet
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workitems/_4EDUsOHwEd-OKOGoixAXCg/rtc_cm:timeSheet"/>
> <rtc_ext:contextId>_9ZDK0OHtEd-OKOGoixAXCg</rtc_ext:contextId>
> <oslc:shortTitle rdf:parseType="Literal">Defect
> 8</oslc:shortTitle>
> <dcterms:identifier>8</dcterms:identifier>
> <dcterms:created>2010-10-27T17:39:30.437Z</dcterms:created>
> <rtc_cm:correctedEstimate></rtc_cm:correctedEstimate>
> <oslc_cmx:severity
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/enumerations/_9ZDK0OHtEd-OKOGoixAXCg/severity/severity.literal.l3"/>
> <oslc_cm:reviewed>false</oslc_cm:reviewed>
> <rtc_cm:filedAgainst
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemOid/com.ibm.team.workitem.Category/_-cZeNOHtEd-OKOGoixAXCg"/>
> <oslc_cm:status>In Progress</oslc_cm:status>
> <oslc_cm:fixed>false</oslc_cm:fixed>
> <dcterms:subject></dcterms:subject>
> <dcterms:modified>2010-10-28T17:56:32.324Z</dcterms:modified>
> <oslc_cm:approved>false</oslc_cm:approved>
> <rtc_cm:resolvedBy
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_YNh4MOlsEdq4xpiOKg5hvA"/>
> <rtc_cm:modifiedBy
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_fivmQOHpEd-ts-6cFr6pTw"/>
> <dcterms:contributor
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_fivmQOHpEd-ts-6cFr6pTw"/>
> <oslc:discussion
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workitems/_4EDUsOHwEd-OKOGoixAXCg/rtc_cm:comments"/>
> <dcterms:type>Defect</dcterms:type>
> <dcterms:description rdf:parseType="Literal">test defect
> number
> 1</dcterms:description>
> <rtc_ext:archived>false</rtc_ext:archived>
> <rtc_cm:com.ibm.team.workitem.linktype.parentworkitem.children
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/9"/>
> <oslc_cmx:priority
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/enumerations/_9ZDK0OHtEd-OKOGoixAXCg/priority/priority.literal.l3"/>
> <oslc_cm:closed>false</oslc_cm:closed>
> <oslc_cm:verified>false</oslc_cm:verified>
> <oslc_cm:inprogress>true</oslc_cm:inprogress>
> <dcterms:title rdf:parseType="Literal">Frank's first
> defect - modified
> via OSLC - and via RTC</dcterms:title>
> <rtc_cm:timeSpent></rtc_cm:timeSpent>
> </rdf:Description>
> </rdf:RDF>
>
> >
> > > fResourceDataset.getLock().enterCriticalSection(Lock.WRITE);
> > > try {
> > > Model model =
> fResourceDataset.getNamedModel(resourceURI);
> > > model.read(instream, null);
> > > //model.close();
> > > } finally { fResourceDataset.getLock().leaveCriticalSection
> () ; }
> > > instream.close();
> > >
> > > After calling this code about 2-3 thousand times, it starts to run
> much
> > > slower, and then eventually I get an exception like this:
> > >
> > > Exception in thread "pool-3-thread-43"
> java.lang.OutOfMemoryError:
> > > Java heap space
> >
> > Could you provide a complete minimal example please? There are some
> > details like how fResourceDataset is set that might make a difference.
>
> It might be hard to get a simple example.
>
> fResourceDataset is created like this:
>
> TDBFactory.createDataset(dirName);
>
> I remove the directory between runs, so it starts with an empty dataset.
>
> I also have this initialization in my program:
>
> static {
> // Configure Jena TDB so that the default data graph in SPARQL
> queries will be the union of all named graphs.
> // Each resource added to the index will be stored in a
> separate TDB data graph.
> // The actual default (hidden) data graph will be used to store
> configuration information for the index.
> TDB.getContext().set(TDB.symUnionDefaultGraph, true);
> TDB.setOptimizerWarningFlag(false); //TODO do we need to
> provide a BGP optimizer?
> }
>
> Could any of this be causing problems?
>
> >
> > The stacktrace might be useful as well although it is not proof exactly
> > where the memory is in use.
>
> It might make more sense for me to try to track this down further myself,
> if you can just confirm that you don't see anything wrong with how I'm
> using Jena, I'll take it from there.
>
> >
> > > I tried increasing the amount of memory, but that just increased the
> number
> > > of calls that succeed (e.g., 10000 vs 2000) before getting the
> exception.
> > >
> > > I'm wondering if there's something I need to do to release
> memory
> between
> > > these calls. I tried putting in a call to model.close(), but all that
> it
> > > seemed to do was make it run slower, but I still got the exception.
> >
> > There isn't anything that should be needed but I'm wondering if
> serveral
> > things like entity.getContent as involved in using memory and it's the
> > cumulative effect that's a problem.
> >
> > > Is there something else I should be doing, or is there a possible
> memory
> > > leak in the version of Jena I'm using (a fairly recent SNAPSHOT
> build)?
> > >
> > > Btw, I tried commenting out the call to model.read(instream, null) to
> > > confirm that the memory leak isn't somewhere else in my program,
> and
> that
> > > worked - i.e., went through the 100,000 calls without an exception.
> > >
> > > Any ideas or pointers to what may be wrong would be appreciated.
> >
> > Another way to do this is to use the bulk loader from the command line.
> > It can read from stdin or from a collection of files.
> >
> > RDF/XML parsing is expensive - N-Triples is fastest.
>
> Is the difference really large? Are there any performance numbers available
> that show Jena performance and load speeds that can be expected?
Yes. See
http://www.semanticoverflow.com/questions/506/pros-and-cons-for-different-jena-backends-sdb-vs-tdb/2983#2983
>
> >
> > Andy
>
> Thanks a lot for your help!
>
> Frank.