Hi
I 'm trying to load in TDB the musicbrainz n-triples dump from :
http://linkedbrainz.c4dmpresents.org/content/rdf-dump
musicbrainz dump in N-Triples is BIG !!!
% ls -l musicbrainz_ngs_dump.rdf.ttl
-rw-r--r-- 1 jmv jmv 25719386678 16 juin 14:58 musicbrainz_ngs_dump.rdf.ttl
The wc command needs 13mn to traverse it !
% time wc musicbrainz_ngs_dump.rdf.ttl
178995221 829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
wc musicbrainz_ngs_dump.rdf.ttl 710,46s user 20,37s system 94% cpu
12:50,43 total
Which means 179 millions of triples !
I had this stack :
Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.overlay(BPTreeNodeMgr.java:194)
at
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.access$100(BPTreeNodeMgr.java:22)
at
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr$Block2BPTreeNode.fromByteBuffer(BPTreeNodeMgr.java:141)
at
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.get(BPTreeNodeMgr.java:84)
at
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get(BPTreeNode.java:127)
at
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:379)
at
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
at
com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
at
com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableNative.accessIndex(NodeTableNative.java:133)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableNative._idForNode(NodeTableNative.java:98)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableNative.getAllocateNodeId(NodeTableNative.java:67)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableCache._idForNode(NodeTableCache.java:108)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableCache.getAllocateNodeId(NodeTableCache.java:67)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableWrapper.getAllocateNodeId(NodeTableWrapper.java:32)
at
com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:39)
at
com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:72)
at
com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$1.send(BulkLoader.java:203)
at
com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$1.send(BulkLoader.java:186)
at org.openjena.riot.lang.LangTurtle.emit(LangTurtle.java:52)
at
org.openjena.riot.lang.LangTurtleBase.checkEmitTriple(LangTurtleBase.java:475)
at
org.openjena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:341)
at
org.openjena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:273)
at
org.openjena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:254)
at
org.openjena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:245)
at
org.openjena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:206)
at
org.openjena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:34)
at
org.openjena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:132)
at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
at org.openjena.riot.RiotReader.parseTriples(RiotReader.java:85)
tdbloader --loc ~/tdb_data musicbrainz_ngs_dump.rdf.ttl 4380,35s user
111,88s system 47% cpu 2:39:01,08 total
I think I had already 9 million of triples from dbPedia in the
database (not sure) before loading mbz .
I kept the original size 1.2Gb of the script.
This was with TDB-0.8.10 .
% java -version
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
% uname -a
Linux oem-laptop 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011
x86_64 GNU/Linux
( Debian )
Is the current state of the data base corrupted ?
Of course I can reload with more memory, but I need to understand
better what TDB does while loading.
Apparently it populates a bplustree in memory while loading .
Does it also happen in normal functioning ? I mean for querying.
For loading this dataset, is it just a matter of splitting before
loading in several steps?
Then the tool should do it itself .
Is there any hope that this dataset works on TDB ?
Have I reached the limit ?
PS
There is an unfinished sentence in http://openjena.org/wiki/TDB/Architecture :
The default storage of each indexes
--
Jean-Marc Vanel
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
http://jmvanel.free.fr/ - EulerGUI, a turntable GUI for Semantic Web +
rules, XML, UML, eCore, Java bytecode
+33 (0)6 89 16 29 52
chat : irc://irc.freenode.net#eulergui