[jira] [Commented] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3

Paolo Castagna (Commented) (JIRA) Sat, 03 Mar 2012 08:58:21 -0800

    [ 
https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221641#comment-13221641
 ]


Paolo Castagna commented on JENA-117:
-------------------------------------

> I have the following in my tdb2.worldbank.ttl:
>     tdb:location "/usr/lib/fuseki/DB/WorldBank" ; 

What else do you have in your config file? What about the union 
unionDefaultGraph settings?

> Almost all of the files have 8388608 bytes. 

That's ok, TDB allocates files in chunks of 8 MB.

Here is how I run tdbloader3 with a 1 million fragment of Open Library dataset:
java -cp 
target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar 
-server -d64 -Xmx5120M cmd.tdbloader3 --no-stats --compression 
--spill-size-auto --loc /tmp/openlibrary 
/opt/datasets/raw/openlibrary-1000000.nt.gz 

The output I see:
INFO  Threshold spill is: 310193029
INFO  Threshold spill is: 310193029
INFO  Threshold spill is: 310193029
INFO  Load: /opt/datasets/raw/openlibrary-1000000.nt.gz -- 2012/03/03 16:46:56 
GMT
INFO  Add: 50,000 tuples (Batch: 33,467 / Avg: 33,467)
INFO  Add: 100,000 tuples (Batch: 85,616 / Avg: 48,123)
[...]
INFO  Add: 1,000,000 tuples (Batch: 116,009 / Avg: 77,351)
INFO    Elapsed: 12.93 seconds [2012/03/03 16:47:09 GMT]
INFO  Threshold spill is: 310193029
INFO  Threshold spill is: 310193029
INFO  Node Table (1/3): building nodes.dat and sorting hash|id ...
INFO  Add: 50,000 records for node table (1/3) phase (Batch: 7,743 / Avg: 7,743)
INFO  Add: 100,000 records for node table (1/3) phase (Batch: 245,098 / Avg: 
15,012)
[...]
INFO  Add: 3,000,000 records for node table (1/3) phase (Batch: 253,807 / Avg: 
141,716)
INFO    Elapsed: 21.17 seconds [2012/03/03 16:47:30 GMT]
INFO  Total: 3,000,000 tuples : 21.17 seconds : 141,709.97 tuples/sec 
[2012/03/03 16:47:30 GMT]
INFO  Node Table (2/3): generating input data using node ids...
INFO  Add: 50,000 records for node table (2/3) phase (Batch: 10,593 / Avg: 
10,593)
INFO  Add: 100,000 records for node table (2/3) phase (Batch: 74,626 / Avg: 
18,552)
[...]
INFO  Add: 500,000 records for node table (3/3) phase (Batch: 3,571,428 / Avg: 
782,472)
INFO    Elapsed: 0.64 seconds [2012/03/03 16:47:43 GMT]
INFO  Total: 542,639 tuples : 0.89 seconds : 612,459.38 tuples/sec [2012/03/03 
16:47:43 GMT]
INFO  Index: creating SPO index...
INFO  Add: 50,000 records to SPO (Batch: 12,813 / Avg: 12,813)
INFO  Add: 100,000 records to SPO (Batch: 1,923,076 / Avg: 25,458)
[...]
INFO  Add: 900,000 records to SPO (Batch: 2,000,000 / Avg: 207,421)
INFO  Total: 927,154 tuples : 4.73 seconds : 195,850.02 tuples/sec [2012/03/03 
16:47:48 GMT]
INFO  Index: creating GSPO index...
INFO  Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/03 16:47:48 GMT]
INFO  Threshold spill is: 310193029
INFO  Index: sorting data for POS index...
INFO  Add: 50,000 records to POS (Batch: 649,350 / Avg: 649,350)
INFO  Add: 100,000 records to POS (Batch: 5,000,000 / Avg: 1,149,425)
[...]
INFO  Add: 900,000 records to POS (Batch: 5,555,555 / Avg: 3,734,439)
INFO  Total: 927,154 tuples : 0.25 seconds : 3,768,918.50 tuples/sec 
[2012/03/03 16:47:48 GMT]
INFO  Index: creating POS index...
INFO  Add: 50,000 records to POS (Batch: 36,873 / Avg: 36,873)
INFO  Add: 100,000 records to POS (Batch: 2,500,000 / Avg: 72,674)
[...]
INFO  Add: 900,000 records to POS (Batch: 2,777,777 / Avg: 537,634)
INFO  Total: 927,154 tuples : 2.03 seconds : 457,402.09 tuples/sec [2012/03/03 
16:47:50 GMT]
INFO  Threshold spill is: 310193029
INFO  Index: sorting data for OSP index...
INFO  Add: 50,000 records to OSP (Batch: 4,166,666 / Avg: 4,166,666)
INFO  Add: 100,000 records to OSP (Batch: 3,846,153 / Avg: 4,000,000)
[...]
INFO  Add: 900,000 records to OSP (Batch: 6,250,000 / Avg: 2,036,199)
INFO  Total: 927,154 tuples : 0.45 seconds : 2,074,170.00 tuples/sec 
[2012/03/03 16:47:51 GMT]
INFO  Index: creating OSP index...
INFO  Add: 50,000 records to OSP (Batch: 47,438 / Avg: 47,438)
INFO  Add: 100,000 records to OSP (Batch: 3,125,000 / Avg: 93,457)
[...]
INFO  Add: 900,000 records to OSP (Batch: 3,571,428 / Avg: 675,675)
INFO  Total: 927,154 tuples : 1.68 seconds : 553,194.50 tuples/sec [2012/03/03 
16:47:53 GMT]
INFO  Threshold spill is: 310193029
INFO  Index: sorting data for GPOS index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Index: creating GPOS index...
INFO  Total: 0 tuples : 0.11 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Threshold spill is: 310193029
INFO  Index: sorting data for GOSP index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Index: creating GOSP index...
INFO  Total: 0 tuples : 0.11 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Threshold spill is: 310193029
INFO  Index: sorting data for POSG index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Index: creating POSG index...
INFO  Total: 0 tuples : 0.07 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Threshold spill is: 310193029
INFO  Index: sorting data for OSPG index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Index: creating OSPG index...
INFO  Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Threshold spill is: 310193029
INFO  Index: sorting data for SPOG index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Index: creating SPOG index...
INFO  Total: 0 tuples : 0.07 seconds : 0.00 tuples/sec [2012/03/03 16:47:53 GMT]
INFO  Total: 1,000,000 tuples : 56.84 seconds : 17,593.86 tuples/sec 
[2012/03/03 16:47:53 GMT]

And, if I try to query the default graph, I get results as expected:
tdbquery --loc /tmp/openlibrary/ "SELECT * {?s ?p ?o} LIMIT 100"

> Given my preferences, do you reckon that tdbloader2 is more suitable? 

Certainly, tdbloader2 is the best (and supported) option we have at the moment 
to load large RDF datasets into TDB.
                
> A pure Java version of tdbloader2, a.k.a. tdbloader3
> ----------------------------------------------------
>
>                 Key: JENA-117
>                 URL: https://issues.apache.org/jira/browse/JENA-117
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: performance, tdbloader2
>         Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in 
> replacing the UNIX sort over text files with an external sorting pure Java 
> implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
>     ThresholdPolicyCount<Tuple<Long>> policy = new 
> ThresholdPolicyCount<Tuple<Long>>(1000000);
>     SerializationFactory<Tuple<Long>> serializerFactory = new 
> TupleSerializationFactory();
>     Comparator<Tuple<Long>> comparator = new TupleComparator();
>     SortedDataBag<Tuple<Long>> sortedDataBag = new 
> SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which 
> are wrappers around DataInputStream|DataOutputStream. TupleComparator is 
> trivial.
> Preliminary results seems promising and show that the Java implementation can 
> be faster than UNIX sort since it uses smaller binary files (instead of text 
> files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is 
> available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is 
> that we could stream results directly into the BPlusTreeRewriter rather than 
> on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant 
> improvement.
> Using compression for intermediate files might help, but more experiments are 
> necessary to establish if it is worthwhile or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3

Reply via email to