Hi Marco,

Very useful to compare with your log on the different runs. Still working
with configuration to see if I can get the ingest data stage to be usable
for hdd. It looks like I get close to the performance of your run on the
earlier stages, while ingest data is still very much too slow. Having to
use SSD may be necessary, for a real world large import to complete?  I'lI
request some ssd storage as well, and hope there's a quota for me :)

Maybe I could also test different distros, to see if some of the default OS
settings affect the import.

Best regards,
Øyvind

søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <marco.neum...@gmail.com>:

> Øyvind, looks like the above was the wrong log from a prior sharding
> experiment.
>
> This is the correct log file for the truthy dataset.
>
> http://www.lotico.com/temp/LOG-98085
>
>
>
> On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <marco.neum...@gmail.com>
> wrote:
>
> > Thank you Øyvind for sharing, great to see more tests in the wild.
> >
> > I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
> > dataset and quickly ran out of disk space. It finished the job but did
> not
> > write any of the indexes to disk due to lack of space. no error messages.
> >
> > http://www.lotico.com/temp/LOG-95239
> >
> > I have now ordered a new 4TB SSD drive to rerun the test possibly with
> the
> > full wikidata dataset,
> >
> > I personally had the best experience with dedicated hardware so far (can
> > be in the data center), shared or dedicated virtual compute engines did
> not
> > deliver as expected. And I have not seen great benefits from data center
> > grade multicore cpus. But I think they will during runtime in multi user
> > settings (eg fuseki).
> >
> > Best,
> > Marco
> >
> > On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oyvin...@gmail.com>
> wrote:
> >
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >> truthy
> >> dump downloaded 2021-12-09.
> >>
> >> The instance is a vm created on the Norwegian Research and Education
> >> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >>
> >> The storage used for dump + temp files  is mounted as a separate  900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >> >  *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and can
> be
> >> overcommitted.
> >>
> >> OS for the instance is a clean Rocky Linux image, with no services
> except
> >> jena/fuseki installed. The systemd service  set up for fuseki is
> stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >>
> >> I'm running from a tmux session to avoid connectivity issues and to
> >> capture
> >> the output. I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >>
> >>  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> slowed
> >> down.
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >>
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the  end of
> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >>
> >> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage": At
> >> the
> >> current rate, if I calculated correctly, it looks like
> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >>
> >> * When I saw sort running in the background for the first parts of the
> >> job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for `sort`.
> >>
> >>        *** WARNING *** The locale specified by the environment affects
> >>        sort order.  Set LC_ALL=C to get the traditional sort order that
> >>        uses native byte values.
> >>
> >> Links:
> >> https://access.redhat.com/solutions/445233
> >>
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>
> >> Best regards,
> >> Øyvind
> >>
> >
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA
> >
> >
>
> --
>
>
> ---
> Marco Neumann
> KONA
>

Reply via email to