Folks

So I've been using tdbloader2 in anger to try and get some exemplar results
for some queries to compare against another implementation for correctness.
However I have been running into a lot of trouble getting tdbloader2 to run
successfully and as a result have some questions/suggestions regarding how
we might improve the loader

The source data is composed of about 68GB of NTriples spread across 21 files
consisting of ~470 million triples which is perhaps somewhat high for using
TDB.  I tried to build this on a EC2 m3.2xlarge node using the two instance
SSD volumes, one for the database and one for sort temp files.  I can point
people to the data if they want to experiment with it on their own setups.

Firstly I had a lot of issues just getting through the bulk load phase, on
one systems the bulk load phase ran for about a day before I killed it.  It
had slowed to a crawl of about 500 triples per second with 60 million
triples to go and has been slowing down gradually since about the 280
million triple mark.  JVM on that system is configured to use 8G heap.

Secondly it would be nice if we could have the phases of the bulk load be
incremental steps.  In one case I successfully completed the data phase
(building the node table) but then sort failed on me due an error about
insufficient temporary disk space.  This it turns out is because sort writes
to /tmp by default which because I was running on EC2 was on the tiny
startup disk.  In order to proceed with the bulk load I had to hack the
scripts to skip the data phase and proceed with the existing files.

So this raises several issues:

1 - Why are the temporary data files named with an extension of the parent
process pid (from $$)?

I assume this is to prevent process collisions but since tdbloader2
explicitly refuses to run if the database directory is non-empty this seems
superfluous.  When I hacked the scripts I had to hard code in the relevant
file extensions, just using .tmp as an extension would avoid this

2 - Could the two phases be separated into different scripts so advanced
users could resume builds that fail after tweaking the SORT_ARGS?

It seems like this wouldn't be too hard to do (hey I just volunteered to
try) and it would make life easier in cases where things did go wrong.

3 - Could node table building be incremental?

Rather than invoking the node table builder on all files could you do it one
file at a time, this would hopefully give the JVM chance to clean up after
each file and possibly avoid getting stuck in the GC hell that seems to
sometimes result.

Of course if that command starts a new node table from scratch each time
then this won't work and I guess this might screw over the statistics
collection?

4 - Could we check the available space on the TMPDIR drive and issue a
warning if we think the free space may be too small?

Not sure how we would determine too small, presumably you'd find the
available space on the drive on which TMPDIR resides (or the user specified
directory in SORT_ARGS if configured) and then find the size of the input
file and compare the two.  Depending on the comparison possibly issue a
warning

If free space is detected as being really low we may also want to bail out
completely with an error

5 - No progress reporting during sort

There is no progress reporting during sort which means the only way to
monitor progress is to look at top and check something is still running.

When running in the foreground something like pv (where available) would
provide useful progress information especially since we know the size of the
input file and the output file (should) be the same size

I'll start playing with this, if people think this is worth exploring I'll
file a proper JIRA for it

Rob


Reply via email to