Thank you Lorenz, can you please post a directory list for Data-0001 with file sizes.
On Thu, Dec 16, 2021 at 8:49 AM LB <conpcompl...@googlemail.com.invalid> wrote: > Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed: > > Server: > > AMD Ryzen 9 5950X (16C/32T) > 128 GB DDR4 ECC RAM > 2 x 3.84 TB NVMe SSD > > > Environment: > > - Ubuntu 20.04.3 LTS > - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04) > - Jena 4.3.1 > > > Command: > > > tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc > > datasets/wikidata-tdb datasets/latest-truthy.nt.bz2 > > > Log summary: > > > 04:14:28 INFO Load node table = 36600 seconds > > 04:14:28 INFO Load ingest data = 25811 seconds > > 04:14:28 INFO Build index SPO = 20688 seconds > > 04:14:28 INFO Build index POS = 35466 seconds > > 04:14:28 INFO Build index OSP = 25042 seconds > > 04:14:28 INFO Overall 143607 seconds > > 04:14:28 INFO Overall 39h 53m 27s > > 04:14:28 INFO Triples loaded = 6.610.055.778 > > 04:14:28 INFO Quads loaded = 0 > > 04:14:28 INFO Overall Rate 46.028 tuples per second > > > Disk space usage according to > > > du -sh datasets/wikidata-tdb > > is > > > 524G datasets/wikidata-tdb > > During loading I could see ~90GB of RAM occupied (50% of total memory > got to sort and it used 2 threads - is it intended to stick to 2 threads > with --parallel 2?) > > > Cheers, > Lorenz > > > On 12.12.21 13:07, Andy Seaborne wrote: > > Hi, Øyvind, > > > > This is all very helpful feedback. Thank you. > > > > On 11/12/2021 21:45, Øyvind Gjesdal wrote: > >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata > >> truthy > >> dump downloaded 2021-12-09. > > > > This is the 4.3.0 xloader? > > > > There are improvements in 4.3.1. Since that release was going out, > > including using less temporary space, the development version got > > merged in. It has had some testing. > > > > It compresses the triples.tmp and intermediate sort files in the index > > stage making the peak usage much smaller. > > > >> The instance is a vm created on the Norwegian Research and Education > >> Cloud, > >> an openstack cloud provider. > >> > >> Instance type: > >> 32 GB memory > >> 4 CPU > > > > I using similar on a 7 year old desktop machine, SATA disk. > > > > I haven't got a machine I can dedicate to the multi-day load. I'll try > > to find a way to at least push it though building the node table. > > > > Loading the first 1B of truthy: > > > > 1B triples , 40kTPS , 06h 54m 10s > > > > The database is 81G and building needs an addition 11.6G for workspace > > for a total of 92G (+ the data file). > > > > While smaller, its seems bz2 files are much slower to decompress so > > I've been using gz files. > > > > My current best guess for 6.4B truthy is > > > > Temp 96G > > Database 540G > > Data 48G > > Total: 684G -- peak disk needed > > > > based on scaling up 1B truthy. Personally, I would make sure there was > > more space. Also - I don't know if the shape of the data is > > sufficiently uniform to make scaling predictable. The time doesn't > > scale so simply. > > > > This is the 4.3.1 version - the 4.3.0 uses a lot more disk space. > > > > Compression reduces the size of triples.tmp -- the related sort > > temporary files which add up to the same again -- 1/6 of the size. > > > >> The storage used for dump + temp files is mounted as a separate 900GB > >> volume and is mounted on /var/fuseki/databases > >> .The type of storage is described as > >>> *mass-storage-default*: Storage backed by spinning hard drives, > >> available to everybody and is the default type. > >> with ext4 configured. At the moment I don't have access to the faster > >> volume type mass-storage-ssd. CPU and memory are not dedicated, and > >> can be > >> overcommitted. > > > > "overcommitted" may be a problem. > > > > While it's not "tdb2 loader parallel" it does use a continuous CPU in > > several threads. > > > > For memory - "it's complicated". > > > > The java parts only need say 2G. The sort is set to "buffer 50% > > --parallel=2" and the java pipes into sort, that's another thread. I > > think the effective peak is 3 active threads and they'll all be at > > 100% for some of the time. > > > > So it's going to need 50% of RAM + 2G for a java proces, +OS. > > > > It does not need space for memory mapped files (they aren't used at > > all in the loading process and I/O is sequential. > > > > If that triggers over commitment swap out, the performance may go down > > a lot. > > > > For disk - if that is physically remote, it should not a problem > > (famous last words). I/O is sequential and in large continuous chunks > > - typical for batch processing jobs. > > > >> OS for the instance is a clean Rocky Linux image, with no services > >> except > >> jena/fuseki installed. The systemd service > > > > set up for fuseki is stopped. > >> jena and fuseki version is 4.3.0. > >> > >> openjdk 11.0.13 2021-10-19 LTS > >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS) > >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing) > > > > Just FYI: Java17 is a little faster. Some java improvements have > > improved RDF parsing speed by up to 10%. in xloader that not > > significant to the overall time. > > > >> I'm running from a tmux session to avoid connectivity issues and to > >> capture > >> the output. > > > > I use > > > > tdb2.xloader .... |& tee LOG-FILE-NAME > > > > to capture the logs and see them. ">&" and "tail -f" would achieve > > much the same effect > > > >> I think the output is stored in memory and not on disk. > >> On First run I tried to have the tmpdir on the root partition, to > >> separate > >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was > >> disk > >> full. For the second (current run) all directories are under > >> /var/fuseki/databases. > > > > Yes - after making that mistake myself, the new version ignores system > > TMPDIR. Using --tmpdir is best but otherwise it defaults to the data > > directory. > > > >> > >> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy > >> --tmpdir > >> /var/fuseki/databases/tmp latest-truthy.nt.gz > >> > >> The import is so far at the "ingest data" stage where it has really > >> slowed > >> down. > > > > FYI: The first line of ingest is always very slow. It is not measuring > > the start point correctly. > > > >> > >> Current output is: > >> > >> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 / > >> Avg: 7,593) > >> > >> See full log so far at > >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab > > > > The earlier first pass also slows down and that should be fairly > > constant-ish speed step once everything settles down. > > > >> Some notes: > >> > >> * There is a (time/info) lapse in the output log between the end of > >> 'parse' and the start of 'index' for Terms. It is unclear to me what is > >> happening in the 1h13 minutes between the lines. > > > > There is "sort" going on. "top" should show it. > > > > For each index there is also a very long pause for exactly the same > > reason. It would be good to have some something go "tick" and log a > > message occasionally. > > > >> > >> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds > >> [2021/12/10 > >> 22:33:46 CET] > >> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds : > >> 6,560,468,631 triples/quads 129,331 TPS > >> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 / > >> Avg: 237,755) > >> > >> * The ingest data step really slows down on the "ingest data stage": > >> At the > >> current rate, if I calculated correctly, it looks like > >> PKG.CmdxIngestData > >> has 10 days left before it finishes. > > > > Ouch. > > > >> * When I saw sort running in the background for the first parts of > >> the job, > >> I looked at the `sort` command. I noticed from some online sources that > >> setting the environment variable LC_ALL=C improves speed for `sort`. > >> Could > >> this be set on the ProcessBuilder for the `sort` process? Could it > >> break/change something? I see the warning from the man page for `sort`. > >> > >> *** WARNING *** The locale specified by the environment affects > >> sort order. Set LC_ALL=C to get the traditional sort order that > >> uses native byte values. > > > > It shouldn't matter but, yes, better to set it and export it in the > > control script and propagate to forked processes. > > > > The sort is doing a binary sort except because it a text sort program, > > the binary is turned into hex (!!). hex is in the ASCII subset and > > shoule be locale safe. > > > > But better to set LC_ALL=C. > > > > Andy > > > > > >> > >> Links: > >> https://access.redhat.com/solutions/445233 > >> > https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram > >> > >> > https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort > >> > >> > >> Best regards, > >> Øyvind > >> > -- --- Marco Neumann KONA