Re: OutOfMemoryError while loading datagraphs

Tim Harsch Thu, 17 Feb 2011 15:24:08 -0800


----- Original Message -----
> From:Frank Budinsky <[email protected]>
> To:[email protected]
> Cc:
> Sent:Thursday, February 17, 2011 8:28 AM
> Subject:Re: OutOfMemoryError while loading datagraphs
> 
> 
> Hi Andy,
> 
> Andy Seaborne <[email protected]> wrote on 02/17/2011 08:35:39
> AM:
> 
> > [image removed]
> >
> > Re: OutOfMemoryError while loading datagraphs
> >
> > Andy Seaborne
> >
> > to:
> >
> > jena-users
> >
> > 02/17/2011 08:37 AM
> >
> > Please respond to jena-users
> >
> >
> >
> > On 16/02/11 18:47, Frank Budinsky wrote:
> > >
> > > Hi,
> >
> > Hi Frank,
> >
> > > I am trying to load about 100,000 datagraphs (roughly 10M triples) 
> into
> a
> > > Jena TDB Dataset, but am running out of memory.
> >
> > 10M total I hope :-)
> 
> Yes, that's the total for this experiment. Would you say that is getting to
> the upper limit of what's possible?
> 
> >
> > Is this on a 32 bit machine or a 64 bit machine? Also, which JVM is it?
> 
> 32 bit machine and standard 1.6 JVM.
> 
> >
> > > I'm doing the load by
> > > repeatedly calling code that looks something like this:
> > >
> > >        InputStream instream = entity.getContent(); // the RDF graph to
> load
> >
> > An input stream of RDF/XML bytes.
> 
> Yes.
> 
> >
> > What does the data look like?
> 
> Pretty standard RDF/XML, ranging in size from 50 - 400 lines of XML. Here's
> one example:
> 
> <rdf:RDF
>     xmlns:rtc_ext="http://jazz.net/xmlns/prod/jazz/rtc/ext/1.0/";
>     xmlns:rtc_cm="http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/";
>     xmlns:oslc_cm="http://open-services.net/ns/cm#";
>     xmlns:dcterms="http://purl.org/dc/terms/";
>     xmlns:oslc_cmx="http://open-services.net/ns/cm-x#";
>     xmlns:oslc="http://open-services.net/ns/core#";
>     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; >
>   <rdf:Description rdf:nodeID="A0">
>     <rdf:subject
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/8"/>
>     <rdf:predicate
> rdf:resource="http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.parentworkitem.children"/>
>     <rdf:object
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/9"/>
>     <rdf:type
> rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement"/>
>     <dcterms:title>9: Frank's second defect - 
> modified</dcterms:title>
>   </rdf:Description>
>   <rdf:Description
> rdf:about="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/8";>
>     <rtc_cm:estimate></rtc_cm:estimate>
>     <rtc_cm:progressTracking
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workitems/_4EDUsOHwEd-OKOGoixAXCg/progressTracking"/>
>     <rdf:type 
> rdf:resource="http://open-services.net/ns/cm#ChangeRequest"/>
>     <oslc:serviceProvider
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/contexts/_9ZDK0OHtEd-OKOGoixAXCg/workitems/services"/>
>     <dcterms:creator
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_fivmQOHpEd-ts-6cFr6pTw"/>
>     <rtc_cm:type
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/types/_9ZDK0OHtEd-OKOGoixAXCg/defect"/>
>     <oslc_cmx:project
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/projectareas/_9ZDK0OHtEd-OKOGoixAXCg"/>
>     <rtc_cm:state
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workflows/_9ZDK0OHtEd-OKOGoixAXCg/states/bugzillaWorkflow/2"/>
>     <rtc_cm:timeSheet
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workitems/_4EDUsOHwEd-OKOGoixAXCg/rtc_cm:timeSheet"/>
>     <rtc_ext:contextId>_9ZDK0OHtEd-OKOGoixAXCg</rtc_ext:contextId>
>     <oslc:shortTitle rdf:parseType="Literal">Defect 
> 8</oslc:shortTitle>
>     <dcterms:identifier>8</dcterms:identifier>
>     <dcterms:created>2010-10-27T17:39:30.437Z</dcterms:created>
>     <rtc_cm:correctedEstimate></rtc_cm:correctedEstimate>
>     <oslc_cmx:severity
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/enumerations/_9ZDK0OHtEd-OKOGoixAXCg/severity/severity.literal.l3"/>
>     <oslc_cm:reviewed>false</oslc_cm:reviewed>
>     <rtc_cm:filedAgainst
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemOid/com.ibm.team.workitem.Category/_-cZeNOHtEd-OKOGoixAXCg"/>
>     <oslc_cm:status>In Progress</oslc_cm:status>
>     <oslc_cm:fixed>false</oslc_cm:fixed>
>     <dcterms:subject></dcterms:subject>
>     <dcterms:modified>2010-10-28T17:56:32.324Z</dcterms:modified>
>     <oslc_cm:approved>false</oslc_cm:approved>
>     <rtc_cm:resolvedBy
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_YNh4MOlsEdq4xpiOKg5hvA"/>
>     <rtc_cm:modifiedBy
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_fivmQOHpEd-ts-6cFr6pTw"/>
>     <dcterms:contributor
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_fivmQOHpEd-ts-6cFr6pTw"/>
>     <oslc:discussion
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workitems/_4EDUsOHwEd-OKOGoixAXCg/rtc_cm:comments"/>
>     <dcterms:type>Defect</dcterms:type>
>     <dcterms:description rdf:parseType="Literal">test defect 
> number
> 1</dcterms:description>
>     <rtc_ext:archived>false</rtc_ext:archived>
>     <rtc_cm:com.ibm.team.workitem.linktype.parentworkitem.children
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/9"/>
>     <oslc_cmx:priority
> rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/enumerations/_9ZDK0OHtEd-OKOGoixAXCg/priority/priority.literal.l3"/>
>     <oslc_cm:closed>false</oslc_cm:closed>
>     <oslc_cm:verified>false</oslc_cm:verified>
>     <oslc_cm:inprogress>true</oslc_cm:inprogress>
>     <dcterms:title rdf:parseType="Literal">Frank's first 
> defect - modified
> via OSLC - and via RTC</dcterms:title>
>     <rtc_cm:timeSpent></rtc_cm:timeSpent>
>   </rdf:Description>
> </rdf:RDF>
> 
> >
> > >        fResourceDataset.getLock().enterCriticalSection(Lock.WRITE);
> > >        try {
> > >              Model model = 
> fResourceDataset.getNamedModel(resourceURI);
> > >              model.read(instream, null);
> > >              //model.close();
> > >        } finally { fResourceDataset.getLock().leaveCriticalSection
> () ; }
> > >        instream.close();
> > >
> > > After calling this code about 2-3 thousand times, it starts to run 
> much
> > > slower, and then eventually I get an exception like this:
> > >
> > >        Exception in thread "pool-3-thread-43"
> java.lang.OutOfMemoryError:
> > > Java heap space
> >
> > Could you provide a complete minimal example please?  There are some
> > details like how fResourceDataset is set that might make a difference.
> 
> It might be hard to get a simple example.
> 
> fResourceDataset is created like this:
> 
>     TDBFactory.createDataset(dirName);
> 
> I remove the directory between runs, so it starts with an empty dataset.
> 
> I also have this initialization in my program:
> 
>       static {
>             // Configure Jena TDB so that the default data graph in SPARQL
> queries will be the union of all named graphs.
>             // Each resource added to the index will be stored in a
> separate TDB data graph.
>             // The actual default (hidden) data graph will be used to store
> configuration information for the index.
>             TDB.getContext().set(TDB.symUnionDefaultGraph, true);
>             TDB.setOptimizerWarningFlag(false); //TODO do we need to
> provide a BGP optimizer?
>       }
> 
> Could any of this be causing problems?
> 
> >
> > The stacktrace might be useful as well although it is not proof exactly
> > where the memory is in use.
> 
> It might make more sense for me to try to track this down further myself,
> if you can just confirm that you don't see anything wrong with how I'm
> using Jena, I'll take it from there.
> 
> >
> > > I tried increasing the amount of memory, but that just increased the
> number
> > > of calls that succeed (e.g., 10000 vs 2000) before getting the
> exception.
> > >
> > > I'm wondering if there's something I need to do to release 
> memory
> between
> > > these calls. I tried putting in a call to model.close(), but all that
> it
> > > seemed to do was make it run slower, but I still got the exception.
> >
> > There isn't anything that should be needed but I'm wondering if 
> serveral
> > things like entity.getContent as involved in using memory and it's the
> > cumulative effect that's a problem.
> >
> > > Is there something else I should be doing, or is there a possible
> memory
> > > leak in the version of Jena I'm using (a fairly recent SNAPSHOT 
> build)?
> > >
> > > Btw, I tried commenting out the call to model.read(instream, null) to
> > > confirm that the memory leak isn't somewhere else in my program, 
> and
> that
> > > worked - i.e., went through the 100,000 calls without an exception.
> > >
> > > Any ideas or pointers to what may be wrong would be appreciated.
> >
> > Another way to do this is to use the bulk loader from the command line.
> >   It can read from stdin or from a collection of files.
> >
> > RDF/XML parsing is expensive - N-Triples is fastest.
> 
> Is the difference really large? Are there any performance numbers available
> that show Jena performance and load speeds that can be expected?

Yes.  See 
http://www.semanticoverflow.com/questions/506/pros-and-cons-for-different-jena-backends-sdb-vs-tdb/2983#2983

> 
> >
> >    Andy
> 
> Thanks a lot for your help!
> 
> Frank.
Re: OutOfMemoryError while loading datagraphs

Reply via email to