Hi Andy,
I did some further analysis of my OutOfMemeoryError problem, and this is
what I've discovered. The problem seems to be that there is one instance of
class NodeTupleTableConcrete that contains an ever growing set of tuples
which eventually uses up all the available heap space and then crashes.
To be more specific, this field in class TupleTable:
private final TupleIndex[] indexes ;
seems to contain 6 continually growing TupleIndexRecord instances
(BPlusTrees). From my measurements, this seems to eat up approximately 1G
of heap for every 1M triples in the Dataset (i.e., about 1K per datagraph).
So, to load my 100K datagraphs (~10M total triples) it would seem to need
10G of heap space.
Does this make sense? How is it supposed to work? Shouldn't the triples
from previously loaded named graphs be eligable for GC when I'm loading the
next named graph? Could it be that I'm holding onto something that's
preventing GC in the TupleTable?
Also, after looking more carefully at the resources being indexed, I
noticed that many of them do have relatively large literals (100s of
characters). I also noticed that when using Fuseki to load the resources I
get lots of warning messages like this, on the console:
Lexical form 'We are currently doing
this:<br></br><br></br>workspaceConnection.replaceComponents
(replaceComponents, replaceSource, falses, false,
monitor);<br></br><br></br>the new way of doing it would be something
like:<br></br><br></br><br></br>
ArrayList<IComponentOp> replaceOps = new
ArrayList<IComponentOp>();<br></br>
for (Iterator iComponents = components.iterator(); iComponents.hasNext();)
{<br></br>
IComponentHandle componentHandle = (IComponentHandle) iComponents.next
();<br></br>
replaceOps.add(promotionTargetConnection.componentOpFactory
().replaceComponent
(componentHandle,<br></br>
buildWorkspaceConnection,
false));<br></br>
}<br></br><br></br>
promotionTargetConnection.applyComponentOperations(replaceOps, monitor);'
not valid for datatype
http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral
Could this be part of the problem?
Thanks,
Frank.
Andy Seaborne <[email protected]> wrote on 02/18/2011 10:29:52
AM:
> [image removed]
>
> Re: OutOfMemoryError while loading datagraphs
>
> Andy Seaborne
>
> to:
>
> jena-users
>
> 02/18/2011 10:33 AM
>
> Please respond to jena-users
>
> >> 10M total I hope :-)
> >
> > Yes, that's the total for this experiment. Would you say that is
getting to
> > the upper limit of what's possible?
>
> Nowhere near.
>
> 32 bit isn't great for scaling but it just goes slower, not breaks.
>
> 64 bit and 1billion is possible if the queries are simple (no data
> mining in SPARQL over 1B triples just yet!) Everythign works - but too
> slow.
>
> >>
> >> Is this on a 32 bit machine or a 64 bit machine? Also, which JVM is
it?
> >
> > 32 bit machine and standard 1.6 JVM.
> >>
> >> What does the data look like?
> >
> > Pretty standard RDF/XML, ranging in size from 50 - 400 lines of XML.
Here's
> > one example:
> >
>
> Looks OK - was just checking it's not full of very, very large literals.
>
> > <rdf:RDF
>
> > </rdf:RDF>
> >
> >>
> >>> fResourceDataset.getLock().enterCriticalSection(Lock.WRITE);
> >>> try {
> >>> Model model = fResourceDataset.getNamedModel
(resourceURI);
> >>> model.read(instream, null);
> >>> //model.close();
> >>> } finally { fResourceDataset.getLock().leaveCriticalSection
> > () ; }
> >>> instream.close();
> >>>
> >>> After calling this code about 2-3 thousand times, it starts to run
much
> >>> slower, and then eventually I get an exception like this:
> >>>
> >>> Exception in thread "pool-3-thread-43"
> > java.lang.OutOfMemoryError:
> >>> Java heap space
> >>
> >> Could you provide a complete minimal example please? There are some
> >> details like how fResourceDataset is set that might make a difference.
> >
> > It might be hard to get a simple example.
> >
> > fResourceDataset is created like this:
> >
> > TDBFactory.createDataset(dirName);
> >
> > I remove the directory between runs, so it starts with an empty
dataset.
> >
> > I also have this initialization in my program:
> >
> > static {
> > // Configure Jena TDB so that the default data graph in
SPARQL
> > queries will be the union of all named graphs.
> > // Each resource added to the index will be stored in a
> > separate TDB data graph.
> > // The actual default (hidden) data graph will be used to
store
> > configuration information for the index.
> > TDB.getContext().set(TDB.symUnionDefaultGraph, true);
>
> Better off for updates just to be simpler but it should not matter.
>
> > TDB.setOptimizerWarningFlag(false); //TODO do we need to
> > provide a BGP optimizer?
>
> No need for updates.
>
> > }
> >
> > Could any of this be causing problems?
>
> That looks good to me.
>
> >
> >>
> >> The stacktrace might be useful as well although it is not proof
exactly
> >> where the memory is in use.
> >
> > It might make more sense for me to try to track this down further
myself,
> > if you can just confirm that you don't see anything wrong with how I'm
> > using Jena, I'll take it from there.
>
> OK
>
> >> RDF/XML parsing is expensive - N-Triples is fastest.
> >
> > Is the difference really large? Are there any performance numbers
available
> > that show Jena performance and load speeds that can be expected?
>
> It's really, really difficult to give useful, honest performance
> numbers. Hardware matters, portables have slow disks, 64 bit is better
> than 32.
>
> But a good workflow is, if getting data from elsewhere, to check it for
> parse errors and bad URIs or literals, then load it. Validation can be
> done by RDF/XML -> N-triples then load the N-triples.
>
> It's in the bulk loader that N-triples makes a big advantage (x2 or
more).
>
> Andy
>
> >
> >>
> >> Andy
> >
> > Thanks a lot for your help!
> >
> > Frank.