The more tests we have on different machines the better. :) Personally I'd say if you have a choice go for a PCIe 4.0 NVMe SSDs and stay away from SATA < III SSDs. Also for the tests SSD RAID isn't necessary.
These components have become extremely affordable in recent years and really should be part of a fast pipeline imo in particular for tdb2.tdbloader in parallel mode. But as Andy has emphasized he has designed the tdb2.xloader process to be spinning-disk friendly. so SSDs are not a prerequisite for xloader On Tue, Dec 14, 2021 at 10:38 AM Øyvind Gjesdal <oyvin...@gmail.com> wrote: > Hi Marco, > > Very useful to compare with your log on the different runs. Still working > with configuration to see if I can get the ingest data stage to be usable > for hdd. It looks like I get close to the performance of your run on the > earlier stages, while ingest data is still very much too slow. Having to > use SSD may be necessary, for a real world large import to complete? I'lI > request some ssd storage as well, and hope there's a quota for me :) > > Maybe I could also test different distros, to see if some of the default OS > settings affect the import. > > Best regards, > Øyvind > > søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <marco.neum...@gmail.com > >: > > > Øyvind, looks like the above was the wrong log from a prior sharding > > experiment. > > > > This is the correct log file for the truthy dataset. > > > > http://www.lotico.com/temp/LOG-98085 > > > > > > > > On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <marco.neum...@gmail.com> > > wrote: > > > > > Thank you Øyvind for sharing, great to see more tests in the wild. > > > > > > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy > > > dataset and quickly ran out of disk space. It finished the job but did > > not > > > write any of the indexes to disk due to lack of space. no error > messages. > > > > > > http://www.lotico.com/temp/LOG-95239 > > > > > > I have now ordered a new 4TB SSD drive to rerun the test possibly with > > the > > > full wikidata dataset, > > > > > > I personally had the best experience with dedicated hardware so far > (can > > > be in the data center), shared or dedicated virtual compute engines did > > not > > > deliver as expected. And I have not seen great benefits from data > center > > > grade multicore cpus. But I think they will during runtime in multi > user > > > settings (eg fuseki). > > > > > > Best, > > > Marco > > > > > > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oyvin...@gmail.com> > > wrote: > > > > > >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata > > >> truthy > > >> dump downloaded 2021-12-09. > > >> > > >> The instance is a vm created on the Norwegian Research and Education > > >> Cloud, > > >> an openstack cloud provider. > > >> > > >> Instance type: > > >> 32 GB memory > > >> 4 CPU > > >> > > >> The storage used for dump + temp files is mounted as a separate > 900GB > > >> volume and is mounted on /var/fuseki/databases > > >> .The type of storage is described as > > >> > *mass-storage-default*: Storage backed by spinning hard drives, > > >> available to everybody and is the default type. > > >> with ext4 configured. At the moment I don't have access to the faster > > >> volume type mass-storage-ssd. CPU and memory are not dedicated, and > can > > be > > >> overcommitted. > > >> > > >> OS for the instance is a clean Rocky Linux image, with no services > > except > > >> jena/fuseki installed. The systemd service set up for fuseki is > > stopped. > > >> jena and fuseki version is 4.3.0. > > >> > > >> openjdk 11.0.13 2021-10-19 LTS > > >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS) > > >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, > sharing) > > >> > > >> I'm running from a tmux session to avoid connectivity issues and to > > >> capture > > >> the output. I think the output is stored in memory and not on disk. > > >> On First run I tried to have the tmpdir on the root partition, to > > separate > > >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was > > disk > > >> full. For the second (current run) all directories are under > > >> /var/fuseki/databases. > > >> > > >> $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy > > >> --tmpdir > > >> /var/fuseki/databases/tmp latest-truthy.nt.gz > > >> > > >> The import is so far at the "ingest data" stage where it has really > > slowed > > >> down. > > >> > > >> Current output is: > > >> > > >> 20:03:43 INFO Data :: Add: 502,000,000 Data (Batch: 3,356 > / > > >> Avg: 7,593) > > >> > > >> See full log so far at > > >> > https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab > > >> > > >> Some notes: > > >> > > >> * There is a (time/info) lapse in the output log between the end of > > >> 'parse' and the start of 'index' for Terms. It is unclear to me what > is > > >> happening in the 1h13 minutes between the lines. > > >> > > >> 22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds > > [2021/12/10 > > >> 22:33:46 CET] > > >> 22:33:52 INFO Terms :: == Parse: 50726.071 seconds : > > >> 6,560,468,631 triples/quads 129,331 TPS > > >> 23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch: > 237,755 / > > >> Avg: 237,755) > > >> > > >> * The ingest data step really slows down on the "ingest data stage": > At > > >> the > > >> current rate, if I calculated correctly, it looks like > > PKG.CmdxIngestData > > >> has 10 days left before it finishes. > > >> > > >> * When I saw sort running in the background for the first parts of the > > >> job, > > >> I looked at the `sort` command. I noticed from some online sources > that > > >> setting the environment variable LC_ALL=C improves speed for `sort`. > > Could > > >> this be set on the ProcessBuilder for the `sort` process? Could it > > >> break/change something? I see the warning from the man page for > `sort`. > > >> > > >> *** WARNING *** The locale specified by the environment affects > > >> sort order. Set LC_ALL=C to get the traditional sort order > that > > >> uses native byte values. > > >> > > >> Links: > > >> https://access.redhat.com/solutions/445233 > > >> > > >> > > > https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram > > >> > > >> > > > https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort > > >> > > >> Best regards, > > >> Øyvind > > >> > > > > > > > > > -- > > > > > > > > > --- > > > Marco Neumann > > > KONA > > > > > > > > > > -- > > > > > > --- > > Marco Neumann > > KONA > > > -- --- Marco Neumann KONA