Re: OutOfMemoryError while loading datagraphs

Frank Budinsky Thu, 17 Feb 2011 08:29:23 -0800

Hi Andy,

Andy Seaborne <[email protected]> wrote on 02/17/2011 08:35:39
AM:


> [image removed]
>
> Re: OutOfMemoryError while loading datagraphs
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 02/17/2011 08:37 AM
>
> Please respond to jena-users
>
>
>
> On 16/02/11 18:47, Frank Budinsky wrote:
> >
> > Hi,
>
> Hi Frank,
>
> > I am trying to load about 100,000 datagraphs (roughly 10M triples) into
a
> > Jena TDB Dataset, but am running out of memory.
>
> 10M total I hope :-)

Yes, that's the total for this experiment. Would you say that is getting to
the upper limit of what's possible?

>
> Is this on a 32 bit machine or a 64 bit machine? Also, which JVM is it?

32 bit machine and standard 1.6 JVM.

>
> > I'm doing the load by
> > repeatedly calling code that looks something like this:
> >
> >        InputStream instream = entity.getContent(); // the RDF graph to
load
>
> An input stream of RDF/XML bytes.

Yes.

>
> What does the data look like?

Pretty standard RDF/XML, ranging in size from 50 - 400 lines of XML. Here's
one example:

<rdf:RDF
    xmlns:rtc_ext="http://jazz.net/xmlns/prod/jazz/rtc/ext/1.0/";
    xmlns:rtc_cm="http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/";
    xmlns:oslc_cm="http://open-services.net/ns/cm#";
    xmlns:dcterms="http://purl.org/dc/terms/";
    xmlns:oslc_cmx="http://open-services.net/ns/cm-x#";
    xmlns:oslc="http://open-services.net/ns/core#";
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; >
  <rdf:Description rdf:nodeID="A0">
    <rdf:subject
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/8"/>
    <rdf:predicate
rdf:resource="http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/com.ibm.team.workitem.linktype.parentworkitem.children"/>
    <rdf:object
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/9"/>
    <rdf:type
rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement"/>
    <dcterms:title>9: Frank's second defect - modified</dcterms:title>
  </rdf:Description>
  <rdf:Description
rdf:about="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/8";>
    <rtc_cm:estimate></rtc_cm:estimate>
    <rtc_cm:progressTracking
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workitems/_4EDUsOHwEd-OKOGoixAXCg/progressTracking"/>
    <rdf:type rdf:resource="http://open-services.net/ns/cm#ChangeRequest"/>
    <oslc:serviceProvider
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/contexts/_9ZDK0OHtEd-OKOGoixAXCg/workitems/services"/>
    <dcterms:creator
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_fivmQOHpEd-ts-6cFr6pTw"/>
    <rtc_cm:type
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/types/_9ZDK0OHtEd-OKOGoixAXCg/defect"/>
    <oslc_cmx:project
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/projectareas/_9ZDK0OHtEd-OKOGoixAXCg"/>
    <rtc_cm:state
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workflows/_9ZDK0OHtEd-OKOGoixAXCg/states/bugzillaWorkflow/2"/>
    <rtc_cm:timeSheet
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workitems/_4EDUsOHwEd-OKOGoixAXCg/rtc_cm:timeSheet"/>
    <rtc_ext:contextId>_9ZDK0OHtEd-OKOGoixAXCg</rtc_ext:contextId>
    <oslc:shortTitle rdf:parseType="Literal">Defect 8</oslc:shortTitle>
    <dcterms:identifier>8</dcterms:identifier>
    <dcterms:created>2010-10-27T17:39:30.437Z</dcterms:created>
    <rtc_cm:correctedEstimate></rtc_cm:correctedEstimate>
    <oslc_cmx:severity
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/enumerations/_9ZDK0OHtEd-OKOGoixAXCg/severity/severity.literal.l3"/>
    <oslc_cm:reviewed>false</oslc_cm:reviewed>
    <rtc_cm:filedAgainst
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemOid/com.ibm.team.workitem.Category/_-cZeNOHtEd-OKOGoixAXCg"/>
    <oslc_cm:status>In Progress</oslc_cm:status>
    <oslc_cm:fixed>false</oslc_cm:fixed>
    <dcterms:subject></dcterms:subject>
    <dcterms:modified>2010-10-28T17:56:32.324Z</dcterms:modified>
    <oslc_cm:approved>false</oslc_cm:approved>
    <rtc_cm:resolvedBy
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_YNh4MOlsEdq4xpiOKg5hvA"/>
    <rtc_cm:modifiedBy
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_fivmQOHpEd-ts-6cFr6pTw"/>
    <dcterms:contributor
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/users/_fivmQOHpEd-ts-6cFr6pTw"/>
    <oslc:discussion
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/workitems/_4EDUsOHwEd-OKOGoixAXCg/rtc_cm:comments"/>
    <dcterms:type>Defect</dcterms:type>
    <dcterms:description rdf:parseType="Literal">test defect number
1</dcterms:description>
    <rtc_ext:archived>false</rtc_ext:archived>
    <rtc_cm:com.ibm.team.workitem.linktype.parentworkitem.children
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/resource/itemName/com.ibm.team.workitem.WorkItem/9"/>
    <oslc_cmx:priority
rdf:resource="https://frankb-tp.torolab.ibm.com:9443/ccm/oslc/enumerations/_9ZDK0OHtEd-OKOGoixAXCg/priority/priority.literal.l3"/>
    <oslc_cm:closed>false</oslc_cm:closed>
    <oslc_cm:verified>false</oslc_cm:verified>
    <oslc_cm:inprogress>true</oslc_cm:inprogress>
    <dcterms:title rdf:parseType="Literal">Frank's first defect - modified
via OSLC - and via RTC</dcterms:title>
    <rtc_cm:timeSpent></rtc_cm:timeSpent>
  </rdf:Description>
</rdf:RDF>

>
> >        fResourceDataset.getLock().enterCriticalSection(Lock.WRITE);
> >        try {
> >              Model model = fResourceDataset.getNamedModel(resourceURI);
> >              model.read(instream, null);
> >              //model.close();
> >        } finally { fResourceDataset.getLock().leaveCriticalSection
() ; }
> >        instream.close();
> >
> > After calling this code about 2-3 thousand times, it starts to run much
> > slower, and then eventually I get an exception like this:
> >
> >        Exception in thread "pool-3-thread-43"
java.lang.OutOfMemoryError:
> > Java heap space
>
> Could you provide a complete minimal example please?  There are some
> details like how fResourceDataset is set that might make a difference.

It might be hard to get a simple example.

fResourceDataset is created like this:

    TDBFactory.createDataset(dirName);

I remove the directory between runs, so it starts with an empty dataset.

I also have this initialization in my program:

      static {
            // Configure Jena TDB so that the default data graph in SPARQL
queries will be the union of all named graphs.
            // Each resource added to the index will be stored in a
separate TDB data graph.
            // The actual default (hidden) data graph will be used to store
configuration information for the index.
            TDB.getContext().set(TDB.symUnionDefaultGraph, true);
            TDB.setOptimizerWarningFlag(false); //TODO do we need to
provide a BGP optimizer?
      }

Could any of this be causing problems?

>
> The stacktrace might be useful as well although it is not proof exactly
> where the memory is in use.

It might make more sense for me to try to track this down further myself,
if you can just confirm that you don't see anything wrong with how I'm
using Jena, I'll take it from there.

>
> > I tried increasing the amount of memory, but that just increased the
number
> > of calls that succeed (e.g., 10000 vs 2000) before getting the
exception.
> >
> > I'm wondering if there's something I need to do to release memory
between
> > these calls. I tried putting in a call to model.close(), but all that
it
> > seemed to do was make it run slower, but I still got the exception.
>
> There isn't anything that should be needed but I'm wondering if serveral
> things like entity.getContent as involved in using memory and it's the
> cumulative effect that's a problem.
>
> > Is there something else I should be doing, or is there a possible
memory
> > leak in the version of Jena I'm using (a fairly recent SNAPSHOT build)?
> >
> > Btw, I tried commenting out the call to model.read(instream, null) to
> > confirm that the memory leak isn't somewhere else in my program, and
that
> > worked - i.e., went through the 100,000 calls without an exception.
> >
> > Any ideas or pointers to what may be wrong would be appreciated.
>
> Another way to do this is to use the bulk loader from the command line.
>   It can read from stdin or from a collection of files.
>
> RDF/XML parsing is expensive - N-Triples is fastest.

Is the difference really large? Are there any performance numbers available
that show Jena performance and load speeds that can be expected?

>
>    Andy

Thanks a lot for your help!

Frank.

Re: OutOfMemoryError while loading datagraphs

Reply via email to