Paolo Castagna wrote: >> With tdbloader2 I had a java.lang.OutOfMemoryError:
[...] >> I'll try giving the JVM more RAM. > > I tried with -Xmx2048m, but I had the same problem. > I'll try with -Xmx4096m. This time, UNIX sort filled /tmp... I'll try specifying the --temporary-directory=DIR or, better, via $TMPDIR env variable (this way there is no need to change tdbloader2 script). >> tdbloader3 run out of disk space (because it is writing temporary files >> in /tmp and the available instance disk space is mounted on /mnt :-() >> I'll see how to change/fix this and re-run. > > This run almost to completion this time, but I was using --spill-size-auto > policy which clearly need improvements. > [...] > > I'll try with a fixed --spill-size 10000000. This time, I was able to load the Freebase data dump (converted into RDF) using tdbloader3. This is how I run tdbloader3 using an EC2 m1.xlarge instance (i.e. 15 GB memory): java -Djava.io.tmpdir=/mnt/data/tmp -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server -d64 -Xmx12288M cmd.tdbloader3 --no-stats --compression --spill-size 10000000 --loc /mnt/data/freebase /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz Total elapsed time to load 618,465,279 triples: Total: 618,465,279 tuples : 53,608.12 seconds : 11,536.78 tuples/sec This is the log: Mar 6 11:43:59 ip-10-53-130-32 build: INFO Load: /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz -- 2012/03/06 11:43:59 UTC Mar 6 11:44:00 ip-10-53-130-32 build: INFO Add: 50,000 tuples (Batch: 35,335 / Avg: 35,335) Mar 6 11:44:01 ip-10-53-130-32 build: INFO Add: 100,000 tuples (Batch: 68,212 / Avg: 46,554) [...] Mar 6 15:32:38 ip-10-53-130-32 build: INFO Add: 618,450,000 tuples (Batch: 89,766 / Avg: 45,079) Mar 6 15:32:38 ip-10-53-130-32 build: INFO Node Table (1/3): building nodes.dat and sorting hash|id ... Mar 6 17:24:46 ip-10-53-130-32 build: INFO Add: 50,000 records for node table (1/3) phase (Batch: 7 / Avg: 7) Mar 6 17:24:47 ip-10-53-130-32 build: INFO Add: 100,000 records for node table (1/3) phase (Batch: 82,236 / Avg: 14) [...] Mar 6 21:23:09 ip-10-53-130-32 build: INFO Add: 1,855,350,000 records for node table (1/3) phase (Batch: 216,450 / Avg: 88,220) Mar 6 21:23:09 ip-10-53-130-32 build: INFO Total: 1,855,395,837 tuples : 21,031.01 seconds : 88,221.91 tuples/sec [2012/03/06 21:23:09 UTC] Mar 6 21:23:40 ip-10-53-130-32 build: INFO Node Table (2/3): generating input data using node ids... Mar 6 23:00:17 ip-10-53-130-32 build: INFO Add: 50,000 records for node table (2/3) phase (Batch: 8 / Avg: 8) Mar 6 23:00:17 ip-10-53-130-32 build: INFO Add: 100,000 records for node table (2/3) phase (Batch: 96,899 / Avg: 17) [...] Mar 7 01:04:18 ip-10-53-130-32 build: INFO Add: 618,450,000 records for node table (2/3) phase (Batch: 95,969 / Avg: 46,718) Mar 7 01:04:18 ip-10-53-130-32 build: INFO Total: 618,463,448 tuples : 13,237.97 seconds : 46,718.90 tuples/sec [2012/03/07 01:04:18 UTC] Mar 7 01:04:23 ip-10-53-130-32 build: INFO Node Table (3/3): building node table B+Tree index (i.e. node2id.dat and node2id.idn files)... Mar 7 01:04:38 ip-10-53-130-32 build: INFO Add: 50,000 records for node table (3/3) phase (Batch: 3,511 / Avg: 3,511) Mar 7 01:04:38 ip-10-53-130-32 build: INFO Add: 100,000 records for node table (3/3) phase (Batch: 375,939 / Avg: 6,958) [...] Mar 7 01:07:21 ip-10-53-130-32 build: INFO Add: 149,050,000 records for node table (3/3) phase (Batch: 980,392 / Avg: 838,537) Mar 7 01:07:24 ip-10-53-130-32 build: INFO Total: 149,066,002 tuples : 180.42 seconds : 826,225.75 tuples/sec [2012/03/07 01:07:24 UTC] Mar 7 01:07:27 ip-10-53-130-32 build: INFO Index: creating SPO index... Mar 7 01:08:14 ip-10-53-130-32 build: INFO Add: 50,000 records to SPO (Batch: 1,065 / Avg: 1,065) Mar 7 01:08:15 ip-10-53-130-32 build: INFO Add: 100,000 records to SPO (Batch: 54,764 / Avg: 2,090) [...] Mar 7 01:18:47 ip-10-53-130-32 build: INFO Add: 618,450,000 records to SPO (Batch: 1,020,408 / Avg: 908,977) Mar 7 01:18:50 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples : 682.99 seconds : 905,528.69 tuples/sec [2012/03/07 01:18:50 UTC] Mar 7 01:18:50 ip-10-53-130-32 build: INFO Index: creating GSPO index... Mar 7 01:18:50 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.12 seconds : 0.00 tuples/sec [2012/03/07 01:18:50 UTC] Mar 7 01:18:56 ip-10-53-130-32 build: INFO Index: sorting data for POS index... Mar 7 01:18:57 ip-10-53-130-32 build: INFO Add: 50,000 records to POS (Batch: 210,084 / Avg: 210,084) Mar 7 01:18:57 ip-10-53-130-32 build: INFO Add: 100,000 records to POS (Batch: 1,724,137 / Avg: 374,531) [...] Mar 7 01:47:03 ip-10-53-130-32 build: INFO Add: 618,450,000 records to POS (Batch: 4,545,454 / Avg: 366,790) Mar 7 01:47:03 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples : 1,686.18 seconds : 366,783.97 tuples/sec [2012/03/07 01:47:03 UTC] Mar 7 01:47:03 ip-10-53-130-32 build: INFO Index: creating POS index... Mar 7 01:47:41 ip-10-53-130-32 build: INFO Add: 50,000 records to POS (Batch: 1,321 / Avg: 1,321) Mar 7 01:47:41 ip-10-53-130-32 build: INFO Add: 100,000 records to POS (Batch: 1,086,956 / Avg: 2,639) [...] Mar 7 01:57:37 ip-10-53-130-32 build: INFO Add: 618,450,000 records to POS (Batch: 1,162,790 / Avg: 974,417) Mar 7 01:57:42 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples : 638.92 seconds : 967,976.50 tuples/sec [2012/03/07 01:57:42 UTC] Mar 7 01:57:47 ip-10-53-130-32 build: INFO Index: sorting data for OSP index... Mar 7 01:57:47 ip-10-53-130-32 build: INFO Add: 50,000 records to OSP (Batch: 373,134 / Avg: 373,134) Mar 7 01:57:47 ip-10-53-130-32 build: INFO Add: 100,000 records to OSP (Batch: 549,450 / Avg: 444,444) [...] Mar 7 02:26:23 ip-10-53-130-32 build: INFO Add: 618,450,000 records to OSP (Batch: 4,166,666 / Avg: 360,257) Mar 7 02:26:23 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples : 1,716.69 seconds : 360,264.44 tuples/sec [2012/03/07 02:26:23 UTC] Mar 7 02:26:23 ip-10-53-130-32 build: INFO Index: creating OSP index... Mar 7 02:27:02 ip-10-53-130-32 build: INFO Add: 50,000 records to OSP (Batch: 1,284 / Avg: 1,284) Mar 7 02:27:03 ip-10-53-130-32 build: INFO Add: 100,000 records to OSP (Batch: 364,963 / Avg: 2,560) [...] Mar 7 02:37:18 ip-10-53-130-32 build: INFO Add: 618,450,000 records to OSP (Batch: 1,020,408 / Avg: 944,877) Mar 7 02:37:22 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples : 658.94 seconds : 938,578.94 tuples/sec [2012/03/07 02:37:22 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for GPOS index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.03 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating GPOS index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for GOSP index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating GOSP index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for POSG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating POSG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for OSPG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating OSPG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for SPOG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating SPOG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 618,465,279 tuples : 53,608.12 seconds : 11,536.78 tuples/sec [2012/03/07 02:37:27 UTC] Paolo