Thank you Øyvind for sharing, great to see more tests in the wild. I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy dataset and quickly ran out of disk space. It finished the job but did not write any of the indexes to disk due to lack of space. no error messages.
http://www.lotico.com/temp/LOG-95239 I have now ordered a new 4TB SSD drive to rerun the test possibly with the full wikidata dataset, I personally had the best experience with dedicated hardware so far (can be in the data center), shared or dedicated virtual compute engines did not deliver as expected. And I have not seen great benefits from data center grade multicore cpus. But I think they will during runtime in multi user settings (eg fuseki). Best, Marco On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oyvin...@gmail.com> wrote: > I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy > dump downloaded 2021-12-09. > > The instance is a vm created on the Norwegian Research and Education Cloud, > an openstack cloud provider. > > Instance type: > 32 GB memory > 4 CPU > > The storage used for dump + temp files is mounted as a separate 900GB > volume and is mounted on /var/fuseki/databases > .The type of storage is described as > > *mass-storage-default*: Storage backed by spinning hard drives, > available to everybody and is the default type. > with ext4 configured. At the moment I don't have access to the faster > volume type mass-storage-ssd. CPU and memory are not dedicated, and can be > overcommitted. > > OS for the instance is a clean Rocky Linux image, with no services except > jena/fuseki installed. The systemd service set up for fuseki is stopped. > jena and fuseki version is 4.3.0. > > openjdk 11.0.13 2021-10-19 LTS > OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS) > OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing) > > I'm running from a tmux session to avoid connectivity issues and to capture > the output. I think the output is stored in memory and not on disk. > On First run I tried to have the tmpdir on the root partition, to separate > temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk > full. For the second (current run) all directories are under > /var/fuseki/databases. > > $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir > /var/fuseki/databases/tmp latest-truthy.nt.gz > > The import is so far at the "ingest data" stage where it has really slowed > down. > > Current output is: > > 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 / > Avg: 7,593) > > See full log so far at > https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab > > Some notes: > > * There is a (time/info) lapse in the output log between the end of > 'parse' and the start of 'index' for Terms. It is unclear to me what is > happening in the 1h13 minutes between the lines. > > 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds [2021/12/10 > 22:33:46 CET] > 22:33:52 INFO Terms :: == Parse: 50726.071 seconds : > 6,560,468,631 triples/quads 129,331 TPS > 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: 237,755 / > Avg: 237,755) > > * The ingest data step really slows down on the "ingest data stage": At the > current rate, if I calculated correctly, it looks like PKG.CmdxIngestData > has 10 days left before it finishes. > > * When I saw sort running in the background for the first parts of the job, > I looked at the `sort` command. I noticed from some online sources that > setting the environment variable LC_ALL=C improves speed for `sort`. Could > this be set on the ProcessBuilder for the `sort` process? Could it > break/change something? I see the warning from the man page for `sort`. > > *** WARNING *** The locale specified by the environment affects > sort order. Set LC_ALL=C to get the traditional sort order that > uses native byte values. > > Links: > https://access.redhat.com/solutions/445233 > > https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram > > https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort > > Best regards, > Øyvind > -- --- Marco Neumann KONA