Re: OutOfMemoryError while loading datagraphs

Andy Seaborne Wed, 02 Mar 2011 06:39:29 -0800

Hi Frank,

On 28/02/11 14:48, Frank Budinsky wrote:


Hi Andy,

I did some further analysis of my OutOfMemeoryError problem, and this is
what I've discovered. The problem seems to be that there is one instance of
class NodeTupleTableConcrete that contains an ever growing set of tuples
which eventually uses up all the available heap space and then crashes.

To be more specific, this field in class TupleTable:

     private final TupleIndex[] indexes ;

seems to contain 6 continually growing TupleIndexRecord instances
(BPlusTrees). From my measurements, this seems to eat up approximately 1G
of heap for every 1M triples in the Dataset (i.e., about 1K per datagraph).
So, to load my 100K datagraphs (~10M total triples) it would seem to need
10G of heap space.

There are 6 indexes for named graphs (see the files GSPO etc). TDB usestotal indexing which puts a lot of work at load time but means anylookup needed is always done with an index scan. The code can run withless indexes - the minimum is one - but that is no exposed in theconfiguration.

Each index holds quads (4 NodeIds, a NodeId is 64 bits on disk). As theindex grows the data goes to disk. There is a finite LRU cache in frontof each index.

Does your dataset have a location? If has no location, it's allin-memory with a RAM-disk like structure. This is for small-scaletesting only - it really does read and write blocks out of the RAM diskby copy to give strict disk-like semantics.

There is also a NodeTable mapping between NodeId and Node (Jena'sgraph-level RDF Term class). This has a cache in front of it .

readPropertiesFile

The long-ish literals maybe the problem. The node table cache isfixed-number, not bounded by size.


The sizeof the caches are controlled by:

SystemTDB.Node2NodeIdCacheSize
SystemTDB.NodeId2NodeCacheSize

These are not easy to control but either (1) get the source code andalter the default values (2) see the code in SystemTDB that uses aproperties file.


If you can end me a copy of the data, I can try loading it here.

Does this make sense? How is it supposed to work? Shouldn't the triples
from previously loaded named graphs be eligable for GC when I'm loading the
next named graph? Could it be that I'm holding onto something that's
preventing GC in the TupleTable?

Also, after looking more carefully at the resources being indexed, I
noticed that many of them do have relatively large literals (100s of
characters). I also noticed that when using Fuseki to load the resources I
get lots of warning messages like this, on the console:

    Lexical form 'We are currently doing
this:<br></br><br></br>workspaceConnection.replaceComponents
(replaceComponents, replaceSource, falses, false,
monitor);<br></br><br></br>the new way of doing it would be something
like:<br></br><br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
ArrayList&lt;IComponentOp&gt; replaceOps = new
ArrayList&lt;IComponentOp&gt;();<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  for (Iterator iComponents = components.iterator(); iComponents.hasNext();)
{<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  IComponentHandle componentHandle = (IComponentHandle) iComponents.next
();<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  replaceOps.add(promotionTargetConnection.componentOpFactory
().replaceComponent
(componentHandle,<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  buildWorkspaceConnection,
false));<br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
}<br></br><br></br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  promotionTargetConnection.applyComponentOperations(replaceOps, monitor);'
not valid for datatype
http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral

Could this be part of the problem?


No - it's a different issue.  This is something coming from the parser.

RDF XMLLiterals have special rules - they must follow

exclusive canonical XML, which means, amongst a lot of other thigs, theyhave to be a single XML node. The rules for exclusive Canonical XML arereally quite strict (e.g. attributes in alphabetical order).


http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral

If you want to store XML or HTML fragments, you can't use RDFXMLLiterals very easily - you have to mangle them to conform to therules. I suggest either as strings or invent your own datatype.


You can run the parser on it's own using
"riotcmd.riot --validate FILE ..."


        Andy


Thanks,
Frank.

Re: OutOfMemoryError while loading datagraphs

Reply via email to