Ok I've started playing with rewriting the scripts as I suggested and am a good way along, have opened a JIRA:
https://issues.apache.org/jira/browse/JENA-977 And have pushed code to a corresponding branch, the code is a good way along but slightly rough around the edges because right now the scripts are slightly hacked so I can use JENA_HOME pointed to a 2.13.0 release build. Also I know the error checking is not correct in at least one place. I'll try and get this finished off next week and then put out a PR for review Rob On 26/06/2015 14:29, "Andy Seaborne" <[email protected]> wrote: >On 25/06/15 15:57, Rob Vesse wrote: >> Folks >> >> So I've been using tdbloader2 in anger to try and get some exemplar >results >> for some queries to compare against another implementation for >correctness. >> However I have been running into a lot of trouble getting tdbloader2 to >run >> successfully and as a result have some questions/suggestions regarding >>how >> we might improve the loader >> >> The source data is composed of about 68GB of NTriples spread across 21 >files >> consisting of ~470 million triples which is perhaps somewhat high for >using >> TDB. I tried to build this on a EC2 m3.2xlarge node using the two >instance >> SSD volumes, one for the database and one for sort temp files. I can >point >> people to the data if they want to experiment with it on their own >>setups. > >Epimorphics use i2.xlarge for data servers, including bulk loading of a >400e6 dataset. They have a much larger SSD. That might be an issue. >2x80 vs 1x800 > >if the SSD has to move data around (they don't like overwrites). > >The shape of the data makes a big difference to performance. > >e.g. > >The unique node > >to > >triple ratio and literals. For literals long literals > > and also proportion of inline-ables (some, not all kinds of numbers, > > date(times)). > >> Firstly I had a lot of issues just getting through the bulk load phase, >>on >> one systems the bulk load phase ran for about a day before I killed it. >It >> had slowed to a crawl of about 500 triples per second with 60 million >> triples to go and has been slowing down gradually since about the 280 >> million triple mark. JVM on that system is configured to use 8G heap. >> >> Secondly it would be nice if we could have the phases of the bulk load >>be >> incremental steps. In one case I successfully completed the data phase >> (building the node table) but then sort failed on me due an error about >> insufficient temporary disk space. This it turns out is because sort >writes >> to /tmp by default which because I was running on EC2 was on the tiny >> startup disk. In order to proceed with the bulk load I had to hack the >> scripts to skip the data phase and proceed with the existing files. >> >> So this raises several issues: >> >> 1 - Why are the temporary data files named with an extension of the >>parent >> process pid (from $$)? > > >> I assume this is to prevent process collisions but since tdbloader2 >> explicitly refuses to run if the database directory is non-empty this >seems >> superfluous. When I hacked the scripts I had to hard code in the >>relevant >> file extensions, just using .tmp as an extension would avoid this > >The unique names mean > > a load can be restarted > >. > >It takes a small amount of hacking the script currently but in >tdbloader2worker you can set KEEPWORKFILES and change to make TMP the >previous number, then skip CmdNodeTableBuilder > >(not ideal but without > >lack of time means polishing the rough edges off suffers a bit; hopefully >recovery isn't needed very often) > >> 2 - Could the two phases be separated into different scripts so advanced >> users could resume builds that fail after tweaking the SORT_ARGS? > >It is possible in the design and needs better exposing - it's a number >of separate steps that > >the script turns into a workflow. Data phase and index phase can be split >apart or choices of indexes built altered. But its current script >hacking. > >> >> It seems like this wouldn't be too hard to do (hey I just volunteered to >> try) and it would make life easier in cases where things did go wrong. >> >> 3 - Could node table building be incremental? >> >> Rather than invoking the node table builder on all files could you do it >one >> file at a time, this would hopefully give the JVM chance to clean up >>after >> each file and possibly avoid getting stuck in the GC hell that seems to >> sometimes result. > >A useful experiment to do. > >Currently, the code isn't tested to do that (it might even "just work" >but I doubt it - may well be close though). > >---- >A longer term goal might be separate loads + a NodeTable merge step. > >NodeIds are allocated incrementally so merging independently built node >tables would need to reallocate nodes. The tuples need rewriting as well. > >(thinking out loud) What might work (experiment needed) is build a bunch >of node tables, then run a merger - my sense is that this will only be >better at quite large scale. It enables parallel node table building at >a cost of another pass. > >For TDB2 (ie. this is a disk change), switching to hash ids would >greatly enhance loading - a parallel loader, with multiple input files >is possible. Cost > >TDB2 is also able to do some parallelism in loading - I experimented >with one thread doing the parse-nodeid and threads doing the separate >index building. So it's not node table parallel building - hash ids > >Hash ids need to be longer than current ids (well, sort of, 8 bytes is a >bit close to the limits (1 billion nodes IIRC from 4store). > >10 bytes with one bit lost for the type indicator would be safer. > >---- > >On GC hell - is all of RAM getting used for mmap files? I suspect and >would like to pin down with f-a-c-t-s that not all RAM gets used for >mmap files in some setups. This may be some OS/process environment >setting limiting it, or whether it is the OS file cache algorithms >choosing a bad path. > >But if you are truly in > >java > >GC hell, as indicated by profiling or GC >logging, then something else is wrong. tdbloader2 data phase does not >use much heap space (2-3G is enough) and more is worse as it competes >with mmap files. But too small is bad for ch > >u > >r > >n reasons - the default >JVM_ARGS is IIRC 1.2G - setting that to 2G might help if it is true JVM >GC hell if the literals distribution is getting in the way. > >Are you setting JVM_ARGS? SORT_ARGS? > >And experimenting with increasing the node cache would be good - >hardware has moved on since I last checked. That assumes things about >the triple/unique node ratio. > >> Of course if that command starts a new node table from scratch each time >> then this won't work and I guess this might screw over the statistics >> collection? > >stats can be recalculated - don't worry about them (and fixed.opt does >an annoying decent job of BGP reordering in recent versions > >for many practical applications). > >> 4 - Could we check the available space on the TMPDIR drive and issue a >> warning if we think the free space may be too small? >> >> Not sure how we would determine too small, presumably you'd find the >> available space on the drive on which TMPDIR resides (or the user >specified >> directory in SORT_ARGS if configured) and then find the size of the >>input >> file and compare the two. Depending on the comparison possibly issue a >> warning > >Good idea. > >> If free space is detected as being really low we may also want to bail >>out >> completely with an error >> >> 5 - No progress reporting during sort >> >> There is no progress reporting during sort which means the only way to >> monitor progress is to look at top and check something is still running. >> >> When running in the foreground something like pv (where available) would >> provide useful progress information especially since we know the size of >the >> input file and the output file (should) be the same size >> >> I'll start playing with this, if people think this is worth exploring >>I'll >> file a proper JIRA for it > >This is hard because the sort is an external one and that program does >not show progress. > >Ideal: sort(1) sorts text so the files to process are text. If there >were a open source binary (fixed width fields) > >efficient external > >sort > > program > >used in place of >sort(1) there might be improvements. Any suggestions? > > >------ > >Another improvement TDB2 makes is that nodes are written in binary. >Disk format change. > > Andy > >> >> Rob >> >> >>
