Thank you Andy. found it in revisions somewhere

just finished another run with truthy

http://lotico.com/temp/LOG-1214

will now increase RAM before running an additional load with increased
thread count.

Marco

On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <a...@apache.org> wrote:

> gists are git repos: so the file is there ... somewhere:
>
>
> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
>
>      Andy
>
> On 19/12/2021 17:56, Marco Neumann wrote:
> > Thank you Lorenz,
> > unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
> > github
> >
> >
> > On Sun, Dec 19, 2021 at 9:46 AM LB <conpcompl...@googlemail.com.invalid>
> > wrote:
> >
> >> I edited the Gist [1] and put the default stats there. Takes ~4min to
> >> compute the stats.
> >>
> >> Findings:
> >>
> >> - for Wikidata we have to extend those stats with the stats for wdt:P31
> >> property as Wikidata does use this property as their own rdf:type
> >> relation. It is indeed trivial, just execute
> >>
> >> select ?c (count(*) as ?cnt) {?s
> >> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
> >>
> >> and convert it into the stats rule language (SSE) and put those rules
> >> before the more generic rule
> >>
> >> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
> >>
> >> - I didn't want to touch the stats script itself, but we could for
> >> example also make this type relation generic and allow for other like
> >> wdt:P31 or skos:subject via a commandline option which provides any URI
> >> as the type relation with default being rdf:type - but that's for sure
> >> probably overkill
> >>
> >> - there is a bug in the stats script or file I guess, because of of some
> >> overflow? the count value is
> >>
> >> (count -1983667112))
> >>
> >> which indicates this.  I opened a ticket:
> >> https://issues.apache.org/jira/browse/JENA-2225
> >>
> >>
> >> [1]
> >> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>
> >> On 18.12.21 11:35, Marco Neumann wrote:
> >>> good morning Lorenz,
> >>>
> >>> Maybe time to get a few query bencharms tests? :)
> >>>
> >>> What does tdb2.tdbstats report?
> >>>
> >>> Marco
> >>>
> >>>
> >>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcompl...@googlemail.com
> .invalid>
> >>> wrote:
> >>>
> >>>> Good morning,
> >>>>
> >>>> loading of Wikidata truthy is done, this time I didn't forget to keep
> >>>> logs:
> >>>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>>
> >>>> I'm a bit surprised that this time it was 8h faster than last time,
> 31h
> >>>> vs 39h. Not sure if a) there was something else on the server last
> time
> >>>> (at least I couldn't see any running tasks) or b) if this is a
> >>>> consequence of the more parallelized Unix sort now - I set it to
> >>>> --parallel=16
> >>>>
> >>>> I mean, the piped input stream is single threaded I guess, but maybe
> the
> >>>> sort merge step can benefit from more threads? I guess I have to clean
> >>>> up everything and run it again with the original setup with 2 Unix
> sort
> >>>> threads ...
> >>>>
> >>>>
> >>>> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>>>
> >>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>>>> ...
> >>>>>
> >>>>>> I am getting a slow down during data ingestion. However, your
> summary
> >>>>>> figures don't show that in the ingest phase. The whole logs may have
> >>>>>> the signal in it but less pronounced.
> >>>>>>
> >>>>>> My working assumption is now that it is random access to the node
> >>>>>> table. Your results point to it not being a CPU issue but that my
> >>>>>> setup is saturating the I/O path. While the portable has a NVMe SSD,
> >>>>>> it has probably not got the same I/O bandwidth as a server class
> >>>>>> machine.
> >>>>>>
> >>>>>> I'm not sure what to do about this other than run with a much bigger
> >>>>>> node table cache for the ingestion phase. Substituting some file
> >>>>>> mapper file area for bigger cache should be a win. While I hadn't
> >>>>>> noticed before, it is probably visible in logs of smaller loads on
> >>>>>> closer inspection. Experimenting on a small dataset is a lot easier.
> >>>>> I'm more sure of this - not yet definite.
> >>>>>
> >>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>>>> Seems rather small for the task.
> >>>>>
> >>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
> SPARQL
> >>>>> results.
> >>>>>
> >>>>> 2 pieces of data will help:
> >>>>>
> >>>>> Experimenting with very small cache settings.
> >>>>>
> >>>>> Letting my slow load keep going to see if there is the same
> >>>>> characteristics at the index stage. There shouldn't be if
> nodeToNodeId
> >>>>> is the cause; it's only an influence in the data ingestion step.
> >>>>>
> >>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
> and
> >>>>> maybe loader=phased. It falls into the same situation although the
> >>>>> improvement there is going to be less marked. "Parallel" saturates
> the
> >>>>> I/O by other means as well.
> >>>>>
> >>>>>       Andy
> >>>
> >>
> >
> >
>


-- 


---
Marco Neumann
KONA

Reply via email to