[ https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221641#comment-13221641 ]
Paolo Castagna commented on JENA-117: ------------------------------------- > I have the following in my tdb2.worldbank.ttl: > tdb:location "/usr/lib/fuseki/DB/WorldBank" ; What else do you have in your config file? What about the union unionDefaultGraph settings? > Almost all of the files have 8388608 bytes. That's ok, TDB allocates files in chunks of 8 MB. Here is how I run tdbloader3 with a 1 million fragment of Open Library dataset: java -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server -d64 -Xmx5120M cmd.tdbloader3 --no-stats --compression --spill-size-auto --loc /tmp/openlibrary /opt/datasets/raw/openlibrary-1000000.nt.gz The output I see: INFO Threshold spill is: 310193029 INFO Threshold spill is: 310193029 INFO Threshold spill is: 310193029 INFO Load: /opt/datasets/raw/openlibrary-1000000.nt.gz -- 2012/03/03 16:46:56 GMT INFO Add: 50,000 tuples (Batch: 33,467 / Avg: 33,467) INFO Add: 100,000 tuples (Batch: 85,616 / Avg: 48,123) [...] INFO Add: 1,000,000 tuples (Batch: 116,009 / Avg: 77,351) INFO Elapsed: 12.93 seconds [2012/03/03 16:47:09 GMT] INFO Threshold spill is: 310193029 INFO Threshold spill is: 310193029 INFO Node Table (1/3): building nodes.dat and sorting hash|id ... INFO Add: 50,000 records for node table (1/3) phase (Batch: 7,743 / Avg: 7,743) INFO Add: 100,000 records for node table (1/3) phase (Batch: 245,098 / Avg: 15,012) [...] INFO Add: 3,000,000 records for node table (1/3) phase (Batch: 253,807 / Avg: 141,716) INFO Elapsed: 21.17 seconds [2012/03/03 16:47:30 GMT] INFO Total: 3,000,000 tuples : 21.17 seconds : 141,709.97 tuples/sec [2012/03/03 16:47:30 GMT] INFO Node Table (2/3): generating input data using node ids... INFO Add: 50,000 records for node table (2/3) phase (Batch: 10,593 / Avg: 10,593) INFO Add: 100,000 records for node table (2/3) phase (Batch: 74,626 / Avg: 18,552) [...] INFO Add: 500,000 records for node table (3/3) phase (Batch: 3,571,428 / Avg: 782,472) INFO Elapsed: 0.64 seconds [2012/03/03 16:47:43 GMT] INFO Total: 542,639 tuples : 0.89 seconds : 612,459.38 tuples/sec [2012/03/03 16:47:43 GMT] INFO Index: creating SPO index... INFO Add: 50,000 records to SPO (Batch: 12,813 / Avg: 12,813) INFO Add: 100,000 records to SPO (Batch: 1,923,076 / Avg: 25,458) [...] INFO Add: 900,000 records to SPO (Batch: 2,000,000 / Avg: 207,421) INFO Total: 927,154 tuples : 4.73 seconds : 195,850.02 tuples/sec [2012/03/03 16:47:48 GMT] INFO Index: creating GSPO index... INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/03 16:47:48 GMT] INFO Threshold spill is: 310193029 INFO Index: sorting data for POS index... INFO Add: 50,000 records to POS (Batch: 649,350 / Avg: 649,350) INFO Add: 100,000 records to POS (Batch: 5,000,000 / Avg: 1,149,425) [...] INFO Add: 900,000 records to POS (Batch: 5,555,555 / Avg: 3,734,439) INFO Total: 927,154 tuples : 0.25 seconds : 3,768,918.50 tuples/sec [2012/03/03 16:47:48 GMT] INFO Index: creating POS index... INFO Add: 50,000 records to POS (Batch: 36,873 / Avg: 36,873) INFO Add: 100,000 records to POS (Batch: 2,500,000 / Avg: 72,674) [...] INFO Add: 900,000 records to POS (Batch: 2,777,777 / Avg: 537,634) INFO Total: 927,154 tuples : 2.03 seconds : 457,402.09 tuples/sec [2012/03/03 16:47:50 GMT] INFO Threshold spill is: 310193029 INFO Index: sorting data for OSP index... INFO Add: 50,000 records to OSP (Batch: 4,166,666 / Avg: 4,166,666) INFO Add: 100,000 records to OSP (Batch: 3,846,153 / Avg: 4,000,000) [...] INFO Add: 900,000 records to OSP (Batch: 6,250,000 / Avg: 2,036,199) INFO Total: 927,154 tuples : 0.45 seconds : 2,074,170.00 tuples/sec [2012/03/03 16:47:51 GMT] INFO Index: creating OSP index... INFO Add: 50,000 records to OSP (Batch: 47,438 / Avg: 47,438) INFO Add: 100,000 records to OSP (Batch: 3,125,000 / Avg: 93,457) [...] INFO Add: 900,000 records to OSP (Batch: 3,571,428 / Avg: 675,675) INFO Total: 927,154 tuples : 1.68 seconds : 553,194.50 tuples/sec [2012/03/03 16:47:53 GMT] INFO Threshold spill is: 310193029 INFO Index: sorting data for GPOS index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Index: creating GPOS index... INFO Total: 0 tuples : 0.11 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Threshold spill is: 310193029 INFO Index: sorting data for GOSP index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Index: creating GOSP index... INFO Total: 0 tuples : 0.11 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Threshold spill is: 310193029 INFO Index: sorting data for POSG index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Index: creating POSG index... INFO Total: 0 tuples : 0.07 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Threshold spill is: 310193029 INFO Index: sorting data for OSPG index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Index: creating OSPG index... INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Threshold spill is: 310193029 INFO Index: sorting data for SPOG index... INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Index: creating SPOG index... INFO Total: 0 tuples : 0.07 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT] INFO Total: 1,000,000 tuples : 56.84 seconds : 17,593.86 tuples/sec [2012/03/03 16:47:53 GMT] And, if I try to query the default graph, I get results as expected: tdbquery --loc /tmp/openlibrary/ "SELECT * {?s ?p ?o} LIMIT 100" > Given my preferences, do you reckon that tdbloader2 is more suitable? Certainly, tdbloader2 is the best (and supported) option we have at the moment to load large RDF datasets into TDB. > A pure Java version of tdbloader2, a.k.a. tdbloader3 > ---------------------------------------------------- > > Key: JENA-117 > URL: https://issues.apache.org/jira/browse/JENA-117 > Project: Apache Jena > Issue Type: Improvement > Components: TDB > Reporter: Paolo Castagna > Assignee: Paolo Castagna > Priority: Minor > Labels: performance, tdbloader2 > Attachments: TDB_JENA-117_r1171714.patch > > > There is probably a significant performance improvement for tdbloader2 in > replacing the UNIX sort over text files with an external sorting pure Java > implementation. > Since JENA-99 we now have a SortedDataBag which does exactly that. > ThresholdPolicyCount<Tuple<Long>> policy = new > ThresholdPolicyCount<Tuple<Long>>(1000000); > SerializationFactory<Tuple<Long>> serializerFactory = new > TupleSerializationFactory(); > Comparator<Tuple<Long>> comparator = new TupleComparator(); > SortedDataBag<Tuple<Long>> sortedDataBag = new > SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator); > TupleSerializationFactory greates TupleInputStream|TupleOutputStream which > are wrappers around DataInputStream|DataOutputStream. TupleComparator is > trivial. > Preliminary results seems promising and show that the Java implementation can > be faster than UNIX sort since it uses smaller binary files (instead of text > files) and it does comparisons of long values rather than strings. > An example of ExternalSort which compare SortedDataBag vs. UNIX sort is > available here: > https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java > A further advantage in doing the sorting with Java rather than UNIX sort is > that we could stream results directly into the BPlusTreeRewriter rather than > on disk and then reading them from disk into the BPlusTreeRewriter. > I've not done an experiment yet to see if this is actually a significant > improvement. > Using compression for intermediate files might help, but more experiments are > necessary to establish if it is worthwhile or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira