Thank you Andy. found it in revisions somewhere just finished another run with truthy
http://lotico.com/temp/LOG-1214 will now increase RAM before running an additional load with increased thread count. Marco On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <a...@apache.org> wrote: > gists are git repos: so the file is there ... somewhere: > > > https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log > > Andy > > On 19/12/2021 17:56, Marco Neumann wrote: > > Thank you Lorenz, > > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in > > github > > > > > > On Sun, Dec 19, 2021 at 9:46 AM LB <conpcompl...@googlemail.com.invalid> > > wrote: > > > >> I edited the Gist [1] and put the default stats there. Takes ~4min to > >> compute the stats. > >> > >> Findings: > >> > >> - for Wikidata we have to extend those stats with the stats for wdt:P31 > >> property as Wikidata does use this property as their own rdf:type > >> relation. It is indeed trivial, just execute > >> > >> select ?c (count(*) as ?cnt) {?s > >> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c > >> > >> and convert it into the stats rule language (SSE) and put those rules > >> before the more generic rule > >> > >> |(<http://www.wikidata.org/prop/direct/P31> 98152611)| > >> > >> - I didn't want to touch the stats script itself, but we could for > >> example also make this type relation generic and allow for other like > >> wdt:P31 or skos:subject via a commandline option which provides any URI > >> as the type relation with default being rdf:type - but that's for sure > >> probably overkill > >> > >> - there is a bug in the stats script or file I guess, because of of some > >> overflow? the count value is > >> > >> (count -1983667112)) > >> > >> which indicates this. I opened a ticket: > >> https://issues.apache.org/jira/browse/JENA-2225 > >> > >> > >> [1] > >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3 > >> > >> On 18.12.21 11:35, Marco Neumann wrote: > >>> good morning Lorenz, > >>> > >>> Maybe time to get a few query bencharms tests? :) > >>> > >>> What does tdb2.tdbstats report? > >>> > >>> Marco > >>> > >>> > >>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcompl...@googlemail.com > .invalid> > >>> wrote: > >>> > >>>> Good morning, > >>>> > >>>> loading of Wikidata truthy is done, this time I didn't forget to keep > >>>> logs: > >>>> > https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3 > >>>> > >>>> I'm a bit surprised that this time it was 8h faster than last time, > 31h > >>>> vs 39h. Not sure if a) there was something else on the server last > time > >>>> (at least I couldn't see any running tasks) or b) if this is a > >>>> consequence of the more parallelized Unix sort now - I set it to > >>>> --parallel=16 > >>>> > >>>> I mean, the piped input stream is single threaded I guess, but maybe > the > >>>> sort merge step can benefit from more threads? I guess I have to clean > >>>> up everything and run it again with the original setup with 2 Unix > sort > >>>> threads ... > >>>> > >>>> > >>>> On 16.12.21 14:48, Andy Seaborne wrote: > >>>>> > >>>>> On 16/12/2021 10:52, Andy Seaborne wrote: > >>>>> ... > >>>>> > >>>>>> I am getting a slow down during data ingestion. However, your > summary > >>>>>> figures don't show that in the ingest phase. The whole logs may have > >>>>>> the signal in it but less pronounced. > >>>>>> > >>>>>> My working assumption is now that it is random access to the node > >>>>>> table. Your results point to it not being a CPU issue but that my > >>>>>> setup is saturating the I/O path. While the portable has a NVMe SSD, > >>>>>> it has probably not got the same I/O bandwidth as a server class > >>>>>> machine. > >>>>>> > >>>>>> I'm not sure what to do about this other than run with a much bigger > >>>>>> node table cache for the ingestion phase. Substituting some file > >>>>>> mapper file area for bigger cache should be a win. While I hadn't > >>>>>> noticed before, it is probably visible in logs of smaller loads on > >>>>>> closer inspection. Experimenting on a small dataset is a lot easier. > >>>>> I'm more sure of this - not yet definite. > >>>>> > >>>>> The nodeToNodeId cache is 200k -- this is on the load/update path. > >>>>> Seems rather small for the task. > >>>>> > >>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by > SPARQL > >>>>> results. > >>>>> > >>>>> 2 pieces of data will help: > >>>>> > >>>>> Experimenting with very small cache settings. > >>>>> > >>>>> Letting my slow load keep going to see if there is the same > >>>>> characteristics at the index stage. There shouldn't be if > nodeToNodeId > >>>>> is the cause; it's only an influence in the data ingestion step. > >>>>> > >>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel > and > >>>>> maybe loader=phased. It falls into the same situation although the > >>>>> improvement there is going to be less marked. "Parallel" saturates > the > >>>>> I/O by other means as well. > >>>>> > >>>>> Andy > >>> > >> > > > > > -- --- Marco Neumann KONA