Re: Some Thoughts on tdbloader2

Rob Vesse Fri, 26 Jun 2015 08:34:29 -0700

Ok

I've started playing with rewriting the scripts as I suggested and am a
good way along, have opened a JIRA:


https://issues.apache.org/jira/browse/JENA-977

And have pushed code to a corresponding branch, the code is a good way
along but slightly rough around the edges because right now the scripts
are slightly hacked so I can use JENA_HOME pointed to a 2.13.0 release
build.  Also I know the error checking is not correct in at least one
place.

I'll try and get this finished off next week and then put out a PR for
review

Rob

On 26/06/2015 14:29, "Andy Seaborne" <[email protected]> wrote:

>On 25/06/15 15:57, Rob Vesse wrote:
>> Folks
>>
>> So I've been using tdbloader2 in anger to try and get some exemplar
>results
>> for some queries to compare against another implementation for
>correctness.
>> However I have been running into a lot of trouble getting tdbloader2 to
>run
>> successfully and as a result have some questions/suggestions regarding
>>how
>> we might improve the loader
>>
>> The source data is composed of about 68GB of NTriples spread across 21
>files
>> consisting of ~470 million triples which is perhaps somewhat high for
>using
>> TDB.  I tried to build this on a EC2 m3.2xlarge node using the two
>instance
>> SSD volumes, one for the database and one for sort temp files.  I can
>point
>> people to the data if they want to experiment with it on their own
>>setups.
>
>Epimorphics use i2.xlarge for data servers, including bulk loading of a
>400e6 dataset.  They have a much larger SSD.  That might be an issue.
>2x80 vs 1x800
>
>if the SSD has to move data around (they don't like overwrites).
>
>The shape of the data makes a big difference to performance.
>
>e.g.
>
>The unique node
>
>to
>
>triple ratio and literals.  For literals long literals
>
> and also proportion of inline-ables (some, not all kinds of numbers,
>
> date(times)).
>
>> Firstly I had a lot of issues just getting through the bulk load phase,
>>on
>> one systems the bulk load phase ran for about a day before I killed it.
>It
>> had slowed to a crawl of about 500 triples per second with 60 million
>> triples to go and has been slowing down gradually since about the 280
>> million triple mark.  JVM on that system is configured to use 8G heap.
>>
>> Secondly it would be nice if we could have the phases of the bulk load
>>be
>> incremental steps.  In one case I successfully completed the data phase
>> (building the node table) but then sort failed on me due an error about
>> insufficient temporary disk space.  This it turns out is because sort
>writes
>> to /tmp by default which because I was running on EC2 was on the tiny
>> startup disk.  In order to proceed with the bulk load I had to hack the
>> scripts to skip the data phase and proceed with the existing files.
>>
>> So this raises several issues:
>>
>> 1 - Why are the temporary data files named with an extension of the
>>parent
>> process pid (from $$)?
> >
>> I assume this is to prevent process collisions but since tdbloader2
>> explicitly refuses to run if the database directory is non-empty this
>seems
>> superfluous.  When I hacked the scripts I had to hard code in the
>>relevant
>> file extensions, just using .tmp as an extension would avoid this
>
>The unique names mean
>
> a load can be restarted
>
>.
>
>It takes a small amount of hacking the script currently but in
>tdbloader2worker you can set KEEPWORKFILES and change to make TMP the
>previous number, then skip CmdNodeTableBuilder
>
>(not ideal but without
>
>lack of time means polishing the rough edges off suffers a bit; hopefully
>recovery isn't needed very often)
>
>> 2 - Could the two phases be separated into different scripts so advanced
>> users could resume builds that fail after tweaking the SORT_ARGS?
>
>It is possible in the design and needs better exposing - it's a number
>of separate steps that
>
>the script turns into a workflow.  Data phase and index phase can be split
>apart or choices of indexes built altered.  But its current script
>hacking.
>
>>
>> It seems like this wouldn't be too hard to do (hey I just volunteered to
>> try) and it would make life easier in cases where things did go wrong.
>>
>> 3 - Could node table building be incremental?
>>
>> Rather than invoking the node table builder on all files could you do it
>one
>> file at a time, this would hopefully give the JVM chance to clean up
>>after
>> each file and possibly avoid getting stuck in the GC hell that seems to
>> sometimes result.
>
>A useful experiment to do.
>
>Currently, the code isn't tested to do that (it might even "just work"
>but I doubt it - may well be close though).
>
>----
>A longer term goal might be separate loads + a NodeTable merge step.
>
>NodeIds are allocated incrementally so merging independently built node
>tables would need to reallocate nodes.  The tuples need rewriting as well.
>
>(thinking out loud) What might work (experiment needed) is build a bunch
>of node tables, then run a merger - my sense is that this will only be
>better at quite large scale.  It enables parallel node table building at
>a cost of another pass.
>
>For TDB2 (ie. this is a disk change), switching to hash ids would
>greatly enhance loading - a parallel loader, with multiple input files
>is possible.  Cost
>
>TDB2 is also able to do some parallelism in loading - I experimented
>with one thread doing the parse-nodeid and threads doing the separate
>index building.  So it's not node table parallel building - hash ids
>
>Hash ids need to be longer than current ids (well, sort of, 8 bytes is a
>bit close to the limits (1 billion nodes IIRC from 4store).
>
>10 bytes with one bit lost for the type indicator would be safer.
>
>----
>
>On GC hell - is all of RAM getting used for mmap files?  I suspect and
>would like to pin down with f-a-c-t-s that not all RAM gets used for
>mmap files in some setups.  This may be some OS/process environment
>setting limiting it, or whether it is the OS file cache algorithms
>choosing a bad path.
>
>But if you are truly in
>
>java
>
>GC hell, as indicated by profiling or GC
>logging, then something else is wrong.  tdbloader2 data phase does not
>use much heap space (2-3G is enough) and more is worse as it competes
>with mmap files. But too small is bad for ch
>
>u
>
>r
>
>n reasons - the default
>JVM_ARGS is IIRC 1.2G - setting that to 2G might help if it is true JVM
>GC hell if the literals distribution is getting in the way.
>
>Are you setting JVM_ARGS?  SORT_ARGS?
>
>And experimenting with increasing the node cache would be good -
>hardware has moved on since I last checked.  That assumes things about
>the triple/unique node ratio.
>
>> Of course if that command starts a new node table from scratch each time
>> then this won't work and I guess this might screw over the statistics
>> collection?
>
>stats can be recalculated - don't worry about them (and fixed.opt does
>an annoying decent job of BGP reordering in recent versions
>
>for many practical applications).
>
>> 4 - Could we check the available space on the TMPDIR drive and issue a
>> warning if we think the free space may be too small?
>>
>> Not sure how we would determine too small, presumably you'd find the
>> available space on the drive on which TMPDIR resides (or the user
>specified
>> directory in SORT_ARGS if configured) and then find the size of the
>>input
>> file and compare the two.  Depending on the comparison possibly issue a
>> warning
>
>Good idea.
>
>> If free space is detected as being really low we may also want to bail
>>out
>> completely with an error
>>
>> 5 - No progress reporting during sort
>>
>> There is no progress reporting during sort which means the only way to
>> monitor progress is to look at top and check something is still running.
>>
>> When running in the foreground something like pv (where available) would
>> provide useful progress information especially since we know the size of
>the
>> input file and the output file (should) be the same size
>>
>> I'll start playing with this, if people think this is worth exploring
>>I'll
>> file a proper JIRA for it
>
>This is hard because the sort is an external one and that program does
>not show progress.
>
>Ideal: sort(1) sorts text so the files to process are text.  If there
>were a open source binary (fixed width fields)
>
>efficient external
>
>sort
>
> program
>
>used in place of
>sort(1) there might be improvements.  Any suggestions?
>
>
>------
>
>Another improvement TDB2 makes is that nodes are written in binary.
>Disk format change.
>
>        Andy
>
>>
>> Rob
>>
>>
>>

Re: Some Thoughts on tdbloader2

Reply via email to