good morning Lorenz,

Maybe time to get a few query bencharms tests? :)

What does tdb2.tdbstats report?

Marco


On Sat, Dec 18, 2021 at 8:09 AM LB <conpcompl...@googlemail.com.invalid>
wrote:

> Good morning,
>
> loading of Wikidata truthy is done, this time I didn't forget to keep
> logs:
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> I'm a bit surprised that this time it was 8h faster than last time, 31h
> vs 39h. Not sure if a) there was something else on the server last time
> (at least I couldn't see any running tasks) or b) if this is a
> consequence of the more parallelized Unix sort now - I set it to
> --parallel=16
>
> I mean, the piped input stream is single threaded I guess, but maybe the
> sort merge step can benefit from more threads? I guess I have to clean
> up everything and run it again with the original setup with 2 Unix sort
> threads ...
>
>
> On 16.12.21 14:48, Andy Seaborne wrote:
> >
> >
> > On 16/12/2021 10:52, Andy Seaborne wrote:
> > ...
> >
> >> I am getting a slow down during data ingestion. However, your summary
> >> figures don't show that in the ingest phase. The whole logs may have
> >> the signal in it but less pronounced.
> >>
> >> My working assumption is now that it is random access to the node
> >> table. Your results point to it not being a CPU issue but that my
> >> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >> it has probably not got the same I/O bandwidth as a server class
> >> machine.
> >>
> >> I'm not sure what to do about this other than run with a much bigger
> >> node table cache for the ingestion phase. Substituting some file
> >> mapper file area for bigger cache should be a win. While I hadn't
> >> noticed before, it is probably visible in logs of smaller loads on
> >> closer inspection. Experimenting on a small dataset is a lot easier.
> >
> > I'm more sure of this - not yet definite.
> >
> > The nodeToNodeId cache is 200k -- this is on the load/update path.
> > Seems rather small for the task.
> >
> > The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
> > results.
> >
> > 2 pieces of data will help:
> >
> > Experimenting with very small cache settings.
> >
> > Letting my slow load keep going to see if there is the same
> > characteristics at the index stage. There shouldn't be if nodeToNodeId
> > is the cause; it's only an influence in the data ingestion step.
> >
> > Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
> > maybe loader=phased. It falls into the same situation although the
> > improvement there is going to be less marked. "Parallel" saturates the
> > I/O by other means as well.
> >
> >     Andy
>


-- 


---
Marco Neumann
KONA

Reply via email to