Hi Lorenz,

On 18/12/2021 08:09, LB wrote:
Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keep logs: https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3

I'm a bit surprised that this time it was 8h faster than last time, 31h vs 39h.

Great!

Not sure if a) there was something else on the server last time (at least I couldn't see any running tasks) or b) if this is a consequence of the more parallelized Unix sort now - I set it to --parallel=16

I mean, the piped input stream is single threaded I guess, but maybe the sort merge step can benefit from more threads?

yes - the sorting itself can be more parallel on a machine the size of you have.

Time to add a configuration file, rather than a slew of command line arguments. The file also then acts as a record of the setup.


I'm finding a new characteristic:

Loading on a smaller machine (32G RAM), I think the index sorting is recombining temp files. That results in more I/O and higher peak disk usage. While POS is always slower, it appears to be very much slower than SPO.

The internet has not been very clear on the effect of "batch size" but the GNU man page talks about "--batch-size=16". I get more than 16 temp files - you probably don't at this scale.

--batch-size=128 seems better -- unlikely to be a problem with the number of file descriptors nowadays. 16 is probably just how it always was.

On my machine: per process:

ulimit -Sn is 1024     -- ulimit current setting
ulimit -Hn is 1048576  -- ulimit max without being root.

I'll investigate when the load finishes. I'm trying not to touch the machine to avoid breaking something. It is currently doing OSP.

I guess I have to clean up everything and run it again with the original setup with 2 Unix sort threads ...

    Andy

Reply via email to