Re: Testing tdb2.xloader

Marco Neumann Sun, 19 Dec 2021 09:56:51 -0800

Thank you Lorenz,
unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
github



On Sun, Dec 19, 2021 at 9:46 AM LB <conpcompl...@googlemail.com.invalid>
wrote:

> I edited the Gist [1] and put the default stats there. Takes ~4min to
> compute the stats.
>
> Findings:
>
> - for Wikidata we have to extend those stats with the stats for wdt:P31
> property as Wikidata does use this property as their own rdf:type
> relation. It is indeed trivial, just execute
>
> select ?c (count(*) as ?cnt) {?s
> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
>
> and convert it into the stats rule language (SSE) and put those rules
> before the more generic rule
>
> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
>
> - I didn't want to touch the stats script itself, but we could for
> example also make this type relation generic and allow for other like
> wdt:P31 or skos:subject via a commandline option which provides any URI
> as the type relation with default being rdf:type - but that's for sure
> probably overkill
>
> - there is a bug in the stats script or file I guess, because of of some
> overflow? the count value is
>
> (count -1983667112))
>
> which indicates this.  I opened a ticket:
> https://issues.apache.org/jira/browse/JENA-2225
>
>
> [1]
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
>
> On 18.12.21 11:35, Marco Neumann wrote:
> > good morning Lorenz,
> >
> > Maybe time to get a few query bencharms tests? :)
> >
> > What does tdb2.tdbstats report?
> >
> > Marco
> >
> >
> > On Sat, Dec 18, 2021 at 8:09 AM LB <conpcompl...@googlemail.com.invalid>
> > wrote:
> >
> >> Good morning,
> >>
> >> loading of Wikidata truthy is done, this time I didn't forget to keep
> >> logs:
> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>
> >> I'm a bit surprised that this time it was 8h faster than last time, 31h
> >> vs 39h. Not sure if a) there was something else on the server last time
> >> (at least I couldn't see any running tasks) or b) if this is a
> >> consequence of the more parallelized Unix sort now - I set it to
> >> --parallel=16
> >>
> >> I mean, the piped input stream is single threaded I guess, but maybe the
> >> sort merge step can benefit from more threads? I guess I have to clean
> >> up everything and run it again with the original setup with 2 Unix sort
> >> threads ...
> >>
> >>
> >> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>
> >>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>> ...
> >>>
> >>>> I am getting a slow down during data ingestion. However, your summary
> >>>> figures don't show that in the ingest phase. The whole logs may have
> >>>> the signal in it but less pronounced.
> >>>>
> >>>> My working assumption is now that it is random access to the node
> >>>> table. Your results point to it not being a CPU issue but that my
> >>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >>>> it has probably not got the same I/O bandwidth as a server class
> >>>> machine.
> >>>>
> >>>> I'm not sure what to do about this other than run with a much bigger
> >>>> node table cache for the ingestion phase. Substituting some file
> >>>> mapper file area for bigger cache should be a win. While I hadn't
> >>>> noticed before, it is probably visible in logs of smaller loads on
> >>>> closer inspection. Experimenting on a small dataset is a lot easier.
> >>> I'm more sure of this - not yet definite.
> >>>
> >>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>> Seems rather small for the task.
> >>>
> >>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by SPARQL
> >>> results.
> >>>
> >>> 2 pieces of data will help:
> >>>
> >>> Experimenting with very small cache settings.
> >>>
> >>> Letting my slow load keep going to see if there is the same
> >>> characteristics at the index stage. There shouldn't be if nodeToNodeId
> >>> is the cause; it's only an influence in the data ingestion step.
> >>>
> >>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel and
> >>> maybe loader=phased. It falls into the same situation although the
> >>> improvement there is going to be less marked. "Parallel" saturates the
> >>> I/O by other means as well.
> >>>
> >>>      Andy
> >
>


-- 


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Reply via email to