tdbloader OutOfMemoryException with musicbrainz nt dump

Jean-Marc Vanel Wed, 07 Dec 2011 08:08:03 -0800

Hi

I 'm trying to load in TDB the musicbrainz n-triples dump from :
http://linkedbrainz.c4dmpresents.org/content/rdf-dump


musicbrainz dump in N-Triples is BIG !!!
 % ls -l  musicbrainz_ngs_dump.rdf.ttl
-rw-r--r-- 1 jmv jmv 25719386678 16 juin  14:58 musicbrainz_ngs_dump.rdf.ttl

The wc command needs 13mn to traverse it !
 % time wc musicbrainz_ngs_dump.rdf.ttl
  178995221   829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
wc musicbrainz_ngs_dump.rdf.ttl  710,46s user 20,37s system 94% cpu
12:50,43 total

Which means 179 millions of triples !

I had this stack :

Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
  Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.overlay(BPTreeNodeMgr.java:194)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.access$100(BPTreeNodeMgr.java:22)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr$Block2BPTreeNode.fromByteBuffer(BPTreeNodeMgr.java:141)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.get(BPTreeNodeMgr.java:84)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get(BPTreeNode.java:127)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:379)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
        at 
com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
        at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTableNative.accessIndex(NodeTableNative.java:133)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTableNative._idForNode(NodeTableNative.java:98)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTableNative.getAllocateNodeId(NodeTableNative.java:67)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTableCache._idForNode(NodeTableCache.java:108)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTableCache.getAllocateNodeId(NodeTableCache.java:67)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTableWrapper.getAllocateNodeId(NodeTableWrapper.java:32)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:39)
        at 
com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:72)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$1.send(BulkLoader.java:203)
        at 
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$1.send(BulkLoader.java:186)
        at org.openjena.riot.lang.LangTurtle.emit(LangTurtle.java:52)
        at 
org.openjena.riot.lang.LangTurtleBase.checkEmitTriple(LangTurtleBase.java:475)
        at 
org.openjena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:341)
        at 
org.openjena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:273)
        at 
org.openjena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:254)
        at 
org.openjena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:245)
        at 
org.openjena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:206)
        at 
org.openjena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:34)
        at 
org.openjena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:132)
        at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
        at org.openjena.riot.RiotReader.parseTriples(RiotReader.java:85)
tdbloader --loc ~/tdb_data musicbrainz_ngs_dump.rdf.ttl  4380,35s user
111,88s system 47% cpu 2:39:01,08 total

I think I had already 9 million of triples from dbPedia in the
database (not sure) before loading mbz .
I kept the original size 1.2Gb of the script.

This was with TDB-0.8.10 .

% java -version
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
% uname -a
Linux oem-laptop 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011
x86_64 GNU/Linux
( Debian )

Is the current state of the data base corrupted ?

Of course I can reload with more memory, but I need to understand
better what TDB does while loading.
Apparently it populates a bplustree in memory while loading .
Does it also happen in normal functioning ? I mean for querying.
For loading this dataset, is it just a matter of splitting before
loading in several steps?
Then the tool should do it itself .

Is there any hope that this dataset works on TDB ?
Have I reached the limit ?

PS
There is an unfinished sentence in http://openjena.org/wiki/TDB/Architecture :
The default storage of each indexes

-- 
Jean-Marc Vanel
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
http://jmvanel.free.fr/ - EulerGUI, a turntable GUI for Semantic Web +
rules, XML, UML, eCore, Java bytecode
+33 (0)6 89 16 29 52
chat :  irc://irc.freenode.net#eulergui

tdbloader OutOfMemoryException with musicbrainz nt dump

Reply via email to