Thank you Lorenz, unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in github
On Sun, Dec 19, 2021 at 9:46 AM LB <conpcompl...@googlemail.com.invalid> wrote: > I edited the Gist [1] and put the default stats there. Takes ~4min to > compute the stats. > > Findings: > > - for Wikidata we have to extend those stats with the stats for wdt:P31 > property as Wikidata does use this property as their own rdf:type > relation. It is indeed trivial, just execute > > select ?c (count(*) as ?cnt) {?s > <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c > > and convert it into the stats rule language (SSE) and put those rules > before the more generic rule > > |(<http://www.wikidata.org/prop/direct/P31> 98152611)| > > - I didn't want to touch the stats script itself, but we could for > example also make this type relation generic and allow for other like > wdt:P31 or skos:subject via a commandline option which provides any URI > as the type relation with default being rdf:type - but that's for sure > probably overkill > > - there is a bug in the stats script or file I guess, because of of some > overflow? the count value is > > (count -1983667112)) > > which indicates this. I opened a ticket: > https://issues.apache.org/jira/browse/JENA-2225 > > > [1] > https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3 > > On 18.12.21 11:35, Marco Neumann wrote: > > good morning Lorenz, > > > > Maybe time to get a few query bencharms tests? :) > > > > What does tdb2.tdbstats report? > > > > Marco > > > > > > On Sat, Dec 18, 2021 at 8:09 AM LB <conpcompl...@googlemail.com.invalid> > > wrote: > > > >> Good morning, > >> > >> loading of Wikidata truthy is done, this time I didn't forget to keep > >> logs: > >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3 > >> > >> I'm a bit surprised that this time it was 8h faster than last time, 31h > >> vs 39h. Not sure if a) there was something else on the server last time > >> (at least I couldn't see any running tasks) or b) if this is a > >> consequence of the more parallelized Unix sort now - I set it to > >> --parallel=16 > >> > >> I mean, the piped input stream is single threaded I guess, but maybe the > >> sort merge step can benefit from more threads? I guess I have to clean > >> up everything and run it again with the original setup with 2 Unix sort > >> threads ... > >> > >> > >> On 16.12.21 14:48, Andy Seaborne wrote: > >>> > >>> On 16/12/2021 10:52, Andy Seaborne wrote: > >>> ... > >>> > >>>> I am getting a slow down during data ingestion. However, your summary > >>>> figures don't show that in the ingest phase. The whole logs may have > >>>> the signal in it but less pronounced. > >>>> > >>>> My working assumption is now that it is random access to the node > >>>> table. Your results point to it not being a CPU issue but that my > >>>> setup is saturating the I/O path. While the portable has a NVMe SSD, > >>>> it has probably not got the same I/O bandwidth as a server class > >>>> machine. > >>>> > >>>> I'm not sure what to do about this other than run with a much bigger > >>>> node table cache for the ingestion phase. Substituting some file > >>>> mapper file area for bigger cache should be a win. While I hadn't > >>>> noticed before, it is probably visible in logs of smaller loads on > >>>> closer inspection. Experimenting on a small dataset is a lot easier. > >>> I'm more sure of this - not yet definite. > >>> > >>> The nodeToNodeId cache is 200k -- this is on the load/update path. > >>> Seems rather small for the task. > >>> > >>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL > >>> results. > >>> > >>> 2 pieces of data will help: > >>> > >>> Experimenting with very small cache settings. > >>> > >>> Letting my slow load keep going to see if there is the same > >>> characteristics at the index stage. There shouldn't be if nodeToNodeId > >>> is the cause; it's only an influence in the data ingestion step. > >>> > >>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and > >>> maybe loader=phased. It falls into the same situation although the > >>> improvement there is going to be less marked. "Parallel" saturates the > >>> I/O by other means as well. > >>> > >>> Andy > > > -- --- Marco Neumann KONA