On 25/06/15 15:57, Rob Vesse wrote:
> Folks
>
> So I've been using tdbloader2 in anger to try and get some exemplar
results
> for some queries to compare against another implementation for
correctness.
> However I have been running into a lot of trouble getting tdbloader2 to
run
> successfully and as a result have some questions/suggestions regarding how
> we might improve the loader
>
> The source data is composed of about 68GB of NTriples spread across 21
files
> consisting of ~470 million triples which is perhaps somewhat high for
using
> TDB.  I tried to build this on a EC2 m3.2xlarge node using the two
instance
> SSD volumes, one for the database and one for sort temp files.  I can
point
> people to the data if they want to experiment with it on their own setups.

Epimorphics use i2.xlarge for data servers, including bulk loading of a
400e6 dataset.  They have a much larger SSD.  That might be an issue.
2x80 vs 1x800

if the SSD has to move data around (they don't like overwrites).

The shape of the data makes a big difference to performance.

e.g.

The unique node

to

triple ratio and literals.  For literals long literals

 and also proportion of inline-ables (some, not all kinds of numbers,

 date(times)).

> Firstly I had a lot of issues just getting through the bulk load phase, on
> one systems the bulk load phase ran for about a day before I killed it.
It
> had slowed to a crawl of about 500 triples per second with 60 million
> triples to go and has been slowing down gradually since about the 280
> million triple mark.  JVM on that system is configured to use 8G heap.
>
> Secondly it would be nice if we could have the phases of the bulk load be
> incremental steps.  In one case I successfully completed the data phase
> (building the node table) but then sort failed on me due an error about
> insufficient temporary disk space.  This it turns out is because sort
writes
> to /tmp by default which because I was running on EC2 was on the tiny
> startup disk.  In order to proceed with the bulk load I had to hack the
> scripts to skip the data phase and proceed with the existing files.
>
> So this raises several issues:
>
> 1 - Why are the temporary data files named with an extension of the parent
> process pid (from $$)?
 >
> I assume this is to prevent process collisions but since tdbloader2
> explicitly refuses to run if the database directory is non-empty this
seems
> superfluous.  When I hacked the scripts I had to hard code in the relevant
> file extensions, just using .tmp as an extension would avoid this

The unique names mean

 a load can be restarted

.

It takes a small amount of hacking the script currently but in
tdbloader2worker you can set KEEPWORKFILES and change to make TMP the
previous number, then skip CmdNodeTableBuilder

(not ideal but without

lack of time means polishing the rough edges off suffers a bit; hopefully
recovery isn't needed very often)

> 2 - Could the two phases be separated into different scripts so advanced
> users could resume builds that fail after tweaking the SORT_ARGS?

It is possible in the design and needs better exposing - it's a number
of separate steps that

the script turns into a workflow.  Data phase and index phase can be split
apart or choices of indexes built altered.  But its current script hacking.

>
> It seems like this wouldn't be too hard to do (hey I just volunteered to
> try) and it would make life easier in cases where things did go wrong.
>
> 3 - Could node table building be incremental?
>
> Rather than invoking the node table builder on all files could you do it
one
> file at a time, this would hopefully give the JVM chance to clean up after
> each file and possibly avoid getting stuck in the GC hell that seems to
> sometimes result.

A useful experiment to do.

Currently, the code isn't tested to do that (it might even "just work"
but I doubt it - may well be close though).

----
A longer term goal might be separate loads + a NodeTable merge step.

NodeIds are allocated incrementally so merging independently built node
tables would need to reallocate nodes.  The tuples need rewriting as well.

(thinking out loud) What might work (experiment needed) is build a bunch
of node tables, then run a merger - my sense is that this will only be
better at quite large scale.  It enables parallel node table building at
a cost of another pass.

For TDB2 (ie. this is a disk change), switching to hash ids would
greatly enhance loading - a parallel loader, with multiple input files
is possible.  Cost

TDB2 is also able to do some parallelism in loading - I experimented
with one thread doing the parse-nodeid and threads doing the separate
index building.  So it's not node table parallel building - hash ids

Hash ids need to be longer than current ids (well, sort of, 8 bytes is a
bit close to the limits (1 billion nodes IIRC from 4store).

10 bytes with one bit lost for the type indicator would be safer.

----

On GC hell - is all of RAM getting used for mmap files?  I suspect and
would like to pin down with f-a-c-t-s that not all RAM gets used for
mmap files in some setups.  This may be some OS/process environment
setting limiting it, or whether it is the OS file cache algorithms
choosing a bad path.

But if you are truly in

java

GC hell, as indicated by profiling or GC
logging, then something else is wrong.  tdbloader2 data phase does not
use much heap space (2-3G is enough) and more is worse as it competes
with mmap files. But too small is bad for ch

u

r

n reasons - the default
JVM_ARGS is IIRC 1.2G - setting that to 2G might help if it is true JVM
GC hell if the literals distribution is getting in the way.

Are you setting JVM_ARGS?  SORT_ARGS?

And experimenting with increasing the node cache would be good -
hardware has moved on since I last checked.  That assumes things about
the triple/unique node ratio.

> Of course if that command starts a new node table from scratch each time
> then this won't work and I guess this might screw over the statistics
> collection?

stats can be recalculated - don't worry about them (and fixed.opt does
an annoying decent job of BGP reordering in recent versions

for many practical applications).

> 4 - Could we check the available space on the TMPDIR drive and issue a
> warning if we think the free space may be too small?
>
> Not sure how we would determine too small, presumably you'd find the
> available space on the drive on which TMPDIR resides (or the user
specified
> directory in SORT_ARGS if configured) and then find the size of the input
> file and compare the two.  Depending on the comparison possibly issue a
> warning

Good idea.

> If free space is detected as being really low we may also want to bail out
> completely with an error
>
> 5 - No progress reporting during sort
>
> There is no progress reporting during sort which means the only way to
> monitor progress is to look at top and check something is still running.
>
> When running in the foreground something like pv (where available) would
> provide useful progress information especially since we know the size of
the
> input file and the output file (should) be the same size
>
> I'll start playing with this, if people think this is worth exploring I'll
> file a proper JIRA for it

This is hard because the sort is an external one and that program does
not show progress.

Ideal: sort(1) sorts text so the files to process are text.  If there
were a open source binary (fixed width fields)

efficient external

sort

 program

used in place of
sort(1) there might be improvements.  Any suggestions?


------

Another improvement TDB2 makes is that nodes are written in binary.
Disk format change.

        Andy

>
> Rob
>
>
>

Reply via email to