Hi Lorenz,
On 18/12/2021 08:09, LB wrote:
Good morning,
loading of Wikidata truthy is done, this time I didn't forget to keep
logs:
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
I'm a bit surprised that this time it was 8h faster than last time, 31h
vs 39h.
Great!
Not sure if a) there was something else on the server last time
(at least I couldn't see any running tasks) or b) if this is a
consequence of the more parallelized Unix sort now - I set it to
--parallel=16
I mean, the piped input stream is single threaded I guess, but maybe the
sort merge step can benefit from more threads?
yes - the sorting itself can be more parallel on a machine the size of
you have.
Time to add a configuration file, rather than a slew of command line
arguments. The file also then acts as a record of the setup.
I'm finding a new characteristic:
Loading on a smaller machine (32G RAM), I think the index sorting is
recombining temp files. That results in more I/O and higher peak disk
usage. While POS is always slower, it appears to be very much slower
than SPO.
The internet has not been very clear on the effect of "batch size" but
the GNU man page talks about "--batch-size=16". I get more than 16 temp
files - you probably don't at this scale.
--batch-size=128 seems better -- unlikely to be a problem with the
number of file descriptors nowadays. 16 is probably just how it always was.
On my machine: per process:
ulimit -Sn is 1024 -- ulimit current setting
ulimit -Hn is 1048576 -- ulimit max without being root.
I'll investigate when the load finishes. I'm trying not to touch the
machine to avoid breaking something. It is currently doing OSP.
I guess I have to clean
up everything and run it again with the original setup with 2 Unix sort
threads ...
Andy