Re: Testing tdb2.xloader

Marco Neumann Tue, 14 Dec 2021 03:44:06 -0800

The more tests we have on different machines the better. :)

Personally I'd say if you have a choice go for a PCIe 4.0 NVMe SSDs and
stay away from SATA < III SSDs. Also for the tests SSD RAID isn't necessary.


These components have become extremely affordable in recent years and
really should be part of a fast pipeline imo in particular for
tdb2.tdbloader in parallel mode.

But as Andy has emphasized he has designed the tdb2.xloader process to
be spinning-disk friendly. so SSDs are not a prerequisite for xloader


On Tue, Dec 14, 2021 at 10:38 AM Øyvind Gjesdal <oyvin...@gmail.com> wrote:

> Hi Marco,
>
> Very useful to compare with your log on the different runs. Still working
> with configuration to see if I can get the ingest data stage to be usable
> for hdd. It looks like I get close to the performance of your run on the
> earlier stages, while ingest data is still very much too slow. Having to
> use SSD may be necessary, for a real world large import to complete?  I'lI
> request some ssd storage as well, and hope there's a quota for me :)
>
> Maybe I could also test different distros, to see if some of the default OS
> settings affect the import.
>
> Best regards,
> Øyvind
>
> søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <marco.neum...@gmail.com
> >:
>
> > Øyvind, looks like the above was the wrong log from a prior sharding
> > experiment.
> >
> > This is the correct log file for the truthy dataset.
> >
> > http://www.lotico.com/temp/LOG-98085
> >
> >
> >
> > On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <marco.neum...@gmail.com>
> > wrote:
> >
> > > Thank you Øyvind for sharing, great to see more tests in the wild.
> > >
> > > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > > dataset and quickly ran out of disk space. It finished the job but did
> > not
> > > write any of the indexes to disk due to lack of space. no error
> messages.
> > >
> > > http://www.lotico.com/temp/LOG-95239
> > >
> > > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> > the
> > > full wikidata dataset,
> > >
> > > I personally had the best experience with dedicated hardware so far
> (can
> > > be in the data center), shared or dedicated virtual compute engines did
> > not
> > > deliver as expected. And I have not seen great benefits from data
> center
> > > grade multicore cpus. But I think they will during runtime in multi
> user
> > > settings (eg fuseki).
> > >
> > > Best,
> > > Marco
> > >
> > > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oyvin...@gmail.com>
> > wrote:
> > >
> > >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> > >> truthy
> > >> dump downloaded 2021-12-09.
> > >>
> > >> The instance is a vm created on the Norwegian Research and Education
> > >> Cloud,
> > >> an openstack cloud provider.
> > >>
> > >> Instance type:
> > >> 32 GB memory
> > >> 4 CPU
> > >>
> > >> The storage used for dump + temp files  is mounted as a separate
> 900GB
> > >> volume and is mounted on /var/fuseki/databases
> > >> .The type of storage is described as
> > >> >  *mass-storage-default*: Storage backed by spinning hard drives,
> > >> available to everybody and is the default type.
> > >> with ext4 configured. At the moment I don't have access to the faster
> > >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> can
> > be
> > >> overcommitted.
> > >>
> > >> OS for the instance is a clean Rocky Linux image, with no services
> > except
> > >> jena/fuseki installed. The systemd service  set up for fuseki is
> > stopped.
> > >> jena and fuseki version is 4.3.0.
> > >>
> > >> openjdk 11.0.13 2021-10-19 LTS
> > >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> > >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
> sharing)
> > >>
> > >> I'm running from a tmux session to avoid connectivity issues and to
> > >> capture
> > >> the output. I think the output is stored in memory and not on disk.
> > >> On First run I tried to have the tmpdir on the root partition, to
> > separate
> > >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> > disk
> > >> full. For the second (current run) all directories are under
> > >> /var/fuseki/databases.
> > >>
> > >>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> > >> --tmpdir
> > >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> > >>
> > >> The import is so far at the "ingest data" stage where it has really
> > slowed
> > >> down.
> > >>
> > >> Current output is:
> > >>
> > >> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356
> /
> > >> Avg: 7,593)
> > >>
> > >> See full log so far at
> > >>
> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> > >>
> > >> Some notes:
> > >>
> > >> * There is a (time/info) lapse in the output log between the  end of
> > >> 'parse' and the start of 'index' for Terms.  It is unclear to me what
> is
> > >> happening in the 1h13 minutes between the lines.
> > >>
> > >> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
> > [2021/12/10
> > >> 22:33:46 CET]
> > >> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> > >> 6,560,468,631 triples/quads 129,331 TPS
> > >> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch:
> 237,755 /
> > >> Avg: 237,755)
> > >>
> > >> * The ingest data step really slows down on the "ingest data stage":
> At
> > >> the
> > >> current rate, if I calculated correctly, it looks like
> > PKG.CmdxIngestData
> > >> has 10 days left before it finishes.
> > >>
> > >> * When I saw sort running in the background for the first parts of the
> > >> job,
> > >> I looked at the `sort` command. I noticed from some online sources
> that
> > >> setting the environment variable LC_ALL=C improves speed for `sort`.
> > Could
> > >> this be set on the ProcessBuilder for the `sort` process? Could it
> > >> break/change something? I see the warning from the man page for
> `sort`.
> > >>
> > >>        *** WARNING *** The locale specified by the environment affects
> > >>        sort order.  Set LC_ALL=C to get the traditional sort order
> that
> > >>        uses native byte values.
> > >>
> > >> Links:
> > >> https://access.redhat.com/solutions/445233
> > >>
> > >>
> >
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> > >>
> > >>
> >
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> > >>
> > >> Best regards,
> > >> Øyvind
> > >>
> > >
> > >
> > > --
> > >
> > >
> > > ---
> > > Marco Neumann
> > > KONA
> > >
> > >
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA
> >
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Reply via email to