Re: Testing tdb2.xloader

Marco Neumann Thu, 16 Dec 2021 01:28:14 -0800

Thank you Lorenz, can you please post a directory list for Data-0001 with
file sizes.



On Thu, Dec 16, 2021 at 8:49 AM LB <conpcompl...@googlemail.com.invalid>
wrote:

> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>
> Server:
>
> AMD Ryzen 9 5950X  (16C/32T)
> 128 GB DDR4 ECC RAM
> 2 x 3.84 TB NVMe SSD
>
>
> Environment:
>
> - Ubuntu 20.04.3 LTS
> - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
> - Jena 4.3.1
>
>
> Command:
>
> > tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
> > datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
>
>
> Log summary:
>
> > 04:14:28 INFO  Load node table  = 36600 seconds
> > 04:14:28 INFO  Load ingest data = 25811 seconds
> > 04:14:28 INFO  Build index SPO  = 20688 seconds
> > 04:14:28 INFO  Build index POS  = 35466 seconds
> > 04:14:28 INFO  Build index OSP  = 25042 seconds
> > 04:14:28 INFO  Overall          143607 seconds
> > 04:14:28 INFO  Overall          39h 53m 27s
> > 04:14:28 INFO  Triples loaded   = 6.610.055.778
> > 04:14:28 INFO  Quads loaded     = 0
> > 04:14:28 INFO  Overall Rate     46.028 tuples per second
>
>
> Disk space usage according to
>
> > du -sh datasets/wikidata-tdb
>
>   is
>
> > 524G    datasets/wikidata-tdb
>
> During loading I could see ~90GB of RAM occupied (50% of total memory
> got to sort and it used 2 threads - is it intended to stick to 2 threads
> with --parallel 2?)
>
>
> Cheers,
> Lorenz
>
>
> On 12.12.21 13:07, Andy Seaborne wrote:
> > Hi, Øyvind,
> >
> > This is all very helpful feedback. Thank you.
> >
> > On 11/12/2021 21:45, Øyvind Gjesdal wrote:
> >> I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
> >> truthy
> >> dump downloaded 2021-12-09.
> >
> > This is the 4.3.0 xloader?
> >
> > There are improvements in 4.3.1. Since that release was going out,
> > including using less temporary space, the development version got
> > merged in. It has had some testing.
> >
> > It compresses the triples.tmp and intermediate sort files in the index
> > stage making the peak usage much smaller.
> >
> >> The instance is a vm created on the Norwegian Research and Education
> >> Cloud,
> >> an openstack cloud provider.
> >>
> >> Instance type:
> >> 32 GB memory
> >> 4 CPU
> >
> > I using similar on a 7 year old desktop machine, SATA disk.
> >
> > I haven't got a machine I can dedicate to the multi-day load. I'll try
> > to find a way to at least push it though building the node table.
> >
> > Loading the first 1B of truthy:
> >
> > 1B triples , 40kTPS , 06h 54m 10s
> >
> > The database is 81G and building needs an addition 11.6G for workspace
> > for a total of 92G (+ the data file).
> >
> > While smaller, its seems bz2 files are much slower to decompress so
> > I've been using gz files.
> >
> > My current best guess for 6.4B truthy is
> >
> > Temp        96G
> > Database   540G
> > Data        48G
> > Total:     684G  -- peak disk needed
> >
> > based on scaling up 1B truthy. Personally, I would make sure there was
> > more space. Also - I don't know if the shape of the data is
> > sufficiently uniform to make scaling predictable.  The time doesn't
> > scale so simply.
> >
> > This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
> >
> > Compression reduces the size of triples.tmp -- the related sort
> > temporary files which add up to the same again -- 1/6 of the size.
> >
> >> The storage used for dump + temp files  is mounted as a separate  900GB
> >> volume and is mounted on /var/fuseki/databases
> >> .The type of storage is described as
> >>>   *mass-storage-default*: Storage backed by spinning hard drives,
> >> available to everybody and is the default type.
> >> with ext4 configured. At the moment I don't have access to the faster
> >> volume type mass-storage-ssd. CPU and memory are not dedicated, and
> >> can be
> >> overcommitted.
> >
> > "overcommitted" may be a problem.
> >
> > While it's not "tdb2 loader parallel" it does use a continuous CPU in
> > several threads.
> >
> > For memory - "it's complicated".
> >
> > The java parts only need say 2G. The sort is set to "buffer 50%
> > --parallel=2" and the java pipes into sort, that's another thread. I
> > think the effective peak is 3 active threads and they'll all be at
> > 100% for some of the time.
> >
> > So it's going to need 50% of RAM + 2G for a java proces, +OS.
> >
> > It does not need space for memory mapped files (they aren't used at
> > all in the loading process and I/O is sequential.
> >
> > If that triggers over commitment swap out, the performance may go down
> > a lot.
> >
> > For disk - if that is physically remote, it should not a problem
> > (famous last words). I/O is sequential and in large continuous chunks
> > - typical for batch processing jobs.
> >
> >> OS for the instance is a clean Rocky Linux image, with no services
> >> except
> >> jena/fuseki installed. The systemd service
> >
> >  set up for fuseki is stopped.
> >> jena and fuseki version is 4.3.0.
> >>
> >> openjdk 11.0.13 2021-10-19 LTS
> >> OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
> >> OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
> >
> > Just FYI: Java17 is a little faster. Some java improvements have
> > improved RDF parsing speed by up to 10%. in xloader that not
> > significant to the overall time.
> >
> >> I'm running from a tmux session to avoid connectivity issues and to
> >> capture
> >> the output.
> >
> > I use
> >
> > tdb2.xloader .... |& tee LOG-FILE-NAME
> >
> > to capture the logs and see them. ">&" and "tail -f" would achieve
> > much the same effect
> >
> >> I think the output is stored in memory and not on disk.
> >> On First run I tried to have the tmpdir on the root partition, to
> >> separate
> >> temp dir and data dir, but with only 19 GB free, the tmpdir soon was
> >> disk
> >> full. For the second (current run) all directories are under
> >> /var/fuseki/databases.
> >
> > Yes - after making that mistake myself, the new version ignores system
> > TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data
> > directory.
> >
> >>
> >>   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
> >> --tmpdir
> >> /var/fuseki/databases/tmp latest-truthy.nt.gz
> >>
> >> The import is so far at the "ingest data" stage where it has really
> >> slowed
> >> down.
> >
> > FYI: The first line of ingest is always very slow. It is not measuring
> > the start point correctly.
> >
> >>
> >> Current output is:
> >>
> >> 20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
> >> Avg: 7,593)
> >>
> >> See full log so far at
> >> https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
> >
> > The earlier first pass also slows down and that should be fairly
> > constant-ish speed step once everything settles down.
> >
> >> Some notes:
> >>
> >> * There is a (time/info) lapse in the output log between the end of
> >> 'parse' and the start of 'index' for Terms.  It is unclear to me what is
> >> happening in the 1h13 minutes between the lines.
> >
> > There is "sort" going on. "top" should show it.
> >
> > For each index there is also a very long pause for exactly the same
> > reason.  It would be good to have some something go "tick" and log a
> > message occasionally.
> >
> >>
> >> 22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
> >> [2021/12/10
> >> 22:33:46 CET]
> >> 22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
> >> 6,560,468,631 triples/quads 129,331 TPS
> >> 23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
> >> Avg: 237,755)
> >>
> >> * The ingest data step really slows down on the "ingest data stage":
> >> At the
> >> current rate, if I calculated correctly, it looks like
> >> PKG.CmdxIngestData
> >> has 10 days left before it finishes.
> >
> > Ouch.
> >
> >> * When I saw sort running in the background for the first parts of
> >> the job,
> >> I looked at the `sort` command. I noticed from some online sources that
> >> setting the environment variable LC_ALL=C improves speed for `sort`.
> >> Could
> >> this be set on the ProcessBuilder for the `sort` process? Could it
> >> break/change something? I see the warning from the man page for `sort`.
> >>
> >>         *** WARNING *** The locale specified by the environment affects
> >>         sort order.  Set LC_ALL=C to get the traditional sort order that
> >>         uses native byte values.
> >
> > It shouldn't matter but, yes, better to set it and export it in the
> > control script and propagate to forked processes.
> >
> > The sort is doing a binary sort except because it a text sort program,
> > the binary is turned into hex (!!). hex is in the ASCII subset and
> > shoule be locale safe.
> >
> > But better to set LC_ALL=C.
> >
> >     Andy
> >
> >
> >>
> >> Links:
> >> https://access.redhat.com/solutions/445233
> >>
> https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
> >>
> >>
> https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
> >>
> >>
> >> Best regards,
> >> Øyvind
> >>
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Reply via email to