Folks So I've been using tdbloader2 in anger to try and get some exemplar results for some queries to compare against another implementation for correctness. However I have been running into a lot of trouble getting tdbloader2 to run successfully and as a result have some questions/suggestions regarding how we might improve the loader
The source data is composed of about 68GB of NTriples spread across 21 files consisting of ~470 million triples which is perhaps somewhat high for using TDB. I tried to build this on a EC2 m3.2xlarge node using the two instance SSD volumes, one for the database and one for sort temp files. I can point people to the data if they want to experiment with it on their own setups. Firstly I had a lot of issues just getting through the bulk load phase, on one systems the bulk load phase ran for about a day before I killed it. It had slowed to a crawl of about 500 triples per second with 60 million triples to go and has been slowing down gradually since about the 280 million triple mark. JVM on that system is configured to use 8G heap. Secondly it would be nice if we could have the phases of the bulk load be incremental steps. In one case I successfully completed the data phase (building the node table) but then sort failed on me due an error about insufficient temporary disk space. This it turns out is because sort writes to /tmp by default which because I was running on EC2 was on the tiny startup disk. In order to proceed with the bulk load I had to hack the scripts to skip the data phase and proceed with the existing files. So this raises several issues: 1 - Why are the temporary data files named with an extension of the parent process pid (from $$)? I assume this is to prevent process collisions but since tdbloader2 explicitly refuses to run if the database directory is non-empty this seems superfluous. When I hacked the scripts I had to hard code in the relevant file extensions, just using .tmp as an extension would avoid this 2 - Could the two phases be separated into different scripts so advanced users could resume builds that fail after tweaking the SORT_ARGS? It seems like this wouldn't be too hard to do (hey I just volunteered to try) and it would make life easier in cases where things did go wrong. 3 - Could node table building be incremental? Rather than invoking the node table builder on all files could you do it one file at a time, this would hopefully give the JVM chance to clean up after each file and possibly avoid getting stuck in the GC hell that seems to sometimes result. Of course if that command starts a new node table from scratch each time then this won't work and I guess this might screw over the statistics collection? 4 - Could we check the available space on the TMPDIR drive and issue a warning if we think the free space may be too small? Not sure how we would determine too small, presumably you'd find the available space on the drive on which TMPDIR resides (or the user specified directory in SORT_ARGS if configured) and then find the size of the input file and compare the two. Depending on the comparison possibly issue a warning If free space is detected as being really low we may also want to bail out completely with an error 5 - No progress reporting during sort There is no progress reporting during sort which means the only way to monitor progress is to look at top and check something is still running. When running in the foreground something like pv (where available) would provide useful progress information especially since we know the size of the input file and the output file (should) be the same size I'll start playing with this, if people think this is worth exploring I'll file a proper JIRA for it Rob
