good morning Lorenz, Maybe time to get a few query bencharms tests? :)
What does tdb2.tdbstats report? Marco On Sat, Dec 18, 2021 at 8:09 AM LB <conpcompl...@googlemail.com.invalid> wrote: > Good morning, > > loading of Wikidata truthy is done, this time I didn't forget to keep > logs: > https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3 > > I'm a bit surprised that this time it was 8h faster than last time, 31h > vs 39h. Not sure if a) there was something else on the server last time > (at least I couldn't see any running tasks) or b) if this is a > consequence of the more parallelized Unix sort now - I set it to > --parallel=16 > > I mean, the piped input stream is single threaded I guess, but maybe the > sort merge step can benefit from more threads? I guess I have to clean > up everything and run it again with the original setup with 2 Unix sort > threads ... > > > On 16.12.21 14:48, Andy Seaborne wrote: > > > > > > On 16/12/2021 10:52, Andy Seaborne wrote: > > ... > > > >> I am getting a slow down during data ingestion. However, your summary > >> figures don't show that in the ingest phase. The whole logs may have > >> the signal in it but less pronounced. > >> > >> My working assumption is now that it is random access to the node > >> table. Your results point to it not being a CPU issue but that my > >> setup is saturating the I/O path. While the portable has a NVMe SSD, > >> it has probably not got the same I/O bandwidth as a server class > >> machine. > >> > >> I'm not sure what to do about this other than run with a much bigger > >> node table cache for the ingestion phase. Substituting some file > >> mapper file area for bigger cache should be a win. While I hadn't > >> noticed before, it is probably visible in logs of smaller loads on > >> closer inspection. Experimenting on a small dataset is a lot easier. > > > > I'm more sure of this - not yet definite. > > > > The nodeToNodeId cache is 200k -- this is on the load/update path. > > Seems rather small for the task. > > > > The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL > > results. > > > > 2 pieces of data will help: > > > > Experimenting with very small cache settings. > > > > Letting my slow load keep going to see if there is the same > > characteristics at the index stage. There shouldn't be if nodeToNodeId > > is the cause; it's only an influence in the data ingestion step. > > > > Aside : Increasing nodeToNodeId could also help tdbloader=parallel and > > maybe loader=phased. It falls into the same situation although the > > improvement there is going to be less marked. "Parallel" saturates the > > I/O by other means as well. > > > > Andy > -- --- Marco Neumann KONA