I'm debugging now and think I've found the some possible culprits of the
slow data ingest stage on my setup.

In the ingest data stage, I see 100% disk use read, with only 2.5 MB/s,
and  the ps command also shows my processor spending time waiting for IO.
In the jena-base/src/main/java/org/apache/jena/atlas/io/IO.java class both
GZIPInputReader and GZIPOutputReader are called with one parameter, which
by default uses "512" as buffer. (
https://github.com/openjdk/jdk/blob/6e1da6440b898d8a10b3db56707a75f32a489149/src/java.base/share/classes/java/util/zip/GZIPInputStream.java#L90
)

I've built a snapshot now, with  "return new
GZIP[Output/Input]Stream(out,8*1024);" (which is the buffer value used in
other places for GZIPOutput/Input in jena), which is currently running, but
I'm thinking of maybe testing larger buffers, and see what my "lowest best
buffer" is. I'm  now waiting for the first steps to complete, to see the
effect of changing it.

What do I need to do to start from the exec_java $PKG.CmdxIngestData step
in tdb2.xloader?

Would
* stopping xloader before this step once
* taking a copy of $LOC
* then copying the copy to $LOC before each run (and commenting out
preceding steps in bin/tdb2.xloader)

be enough to have a working test setup to start at "data ingest", or do I
need to do something else?

Best regards,
Øyvind



tir. 28. des. 2021 kl. 14:21 skrev Andy Seaborne <a...@apache.org>:

> Excellent news!
>
> Updated: https://www.w3.org/wiki/LargeTripleStores
>
>      Andy
>
> On 28/12/2021 10:11, Marco Neumann wrote:
> > Ok here is another successful tdb2 load. this time with the full wikidata
> > download (20211222_latest-all.nt.gz 172G ) file.
> >
> > counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
> > entire load.
> >
> > I think with the right hardware this could easily be time compressed
> quite
> > a bit.
> >
> > http://www.lotico.com/temp/LOG-45497
> >
> > Best,
> > Marco
> >
> >
> > On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <marco.neum...@gmail.com>
> > wrote:
> >
> >> Thank you Andy. found it in revisions somewhere
> >>
> >> just finished another run with truthy
> >>
> >> http://lotico.com/temp/LOG-1214
> >>
> >> will now increase RAM before running an additional load with increased
> >> thread count.
> >>
> >> Marco
> >>
> >> On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <a...@apache.org> wrote:
> >>
> >>> gists are git repos: so the file is there ... somewhere:
> >>>
> >>>
> >>>
> https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
> >>>
> >>>       Andy
> >>>
> >>> On 19/12/2021 17:56, Marco Neumann wrote:
> >>>> Thank you Lorenz,
> >>>> unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
> >>>> github
> >>>>
> >>>>
> >>>> On Sun, Dec 19, 2021 at 9:46 AM LB <conpcompl...@googlemail.com
> >>> .invalid>
> >>>> wrote:
> >>>>
> >>>>> I edited the Gist [1] and put the default stats there. Takes ~4min to
> >>>>> compute the stats.
> >>>>>
> >>>>> Findings:
> >>>>>
> >>>>> - for Wikidata we have to extend those stats with the stats for
> wdt:P31
> >>>>> property as Wikidata does use this property as their own rdf:type
> >>>>> relation. It is indeed trivial, just execute
> >>>>>
> >>>>> select ?c (count(*) as ?cnt) {?s
> >>>>> <http://www.wikidata.org/prop/direct/P31> ?c} group by ?c
> >>>>>
> >>>>> and convert it into the stats rule language (SSE) and put those rules
> >>>>> before the more generic rule
> >>>>>
> >>>>> |(<http://www.wikidata.org/prop/direct/P31> 98152611)|
> >>>>>
> >>>>> - I didn't want to touch the stats script itself, but we could for
> >>>>> example also make this type relation generic and allow for other like
> >>>>> wdt:P31 or skos:subject via a commandline option which provides any
> URI
> >>>>> as the type relation with default being rdf:type - but that's for
> sure
> >>>>> probably overkill
> >>>>>
> >>>>> - there is a bug in the stats script or file I guess, because of of
> >>> some
> >>>>> overflow? the count value is
> >>>>>
> >>>>> (count -1983667112))
> >>>>>
> >>>>> which indicates this.  I opened a ticket:
> >>>>> https://issues.apache.org/jira/browse/JENA-2225
> >>>>>
> >>>>>
> >>>>> [1]
> >>>>>
> >>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>>>
> >>>>> On 18.12.21 11:35, Marco Neumann wrote:
> >>>>>> good morning Lorenz,
> >>>>>>
> >>>>>> Maybe time to get a few query bencharms tests? :)
> >>>>>>
> >>>>>> What does tdb2.tdbstats report?
> >>>>>>
> >>>>>> Marco
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Dec 18, 2021 at 8:09 AM LB <conpcompl...@googlemail.com
> >>> .invalid>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Good morning,
> >>>>>>>
> >>>>>>> loading of Wikidata truthy is done, this time I didn't forget to
> keep
> >>>>>>> logs:
> >>>>>>>
> >>>
> https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
> >>>>>>>
> >>>>>>> I'm a bit surprised that this time it was 8h faster than last time,
> >>> 31h
> >>>>>>> vs 39h. Not sure if a) there was something else on the server last
> >>> time
> >>>>>>> (at least I couldn't see any running tasks) or b) if this is a
> >>>>>>> consequence of the more parallelized Unix sort now - I set it to
> >>>>>>> --parallel=16
> >>>>>>>
> >>>>>>> I mean, the piped input stream is single threaded I guess, but
> maybe
> >>> the
> >>>>>>> sort merge step can benefit from more threads? I guess I have to
> >>> clean
> >>>>>>> up everything and run it again with the original setup with 2 Unix
> >>> sort
> >>>>>>> threads ...
> >>>>>>>
> >>>>>>>
> >>>>>>> On 16.12.21 14:48, Andy Seaborne wrote:
> >>>>>>>>
> >>>>>>>> On 16/12/2021 10:52, Andy Seaborne wrote:
> >>>>>>>> ...
> >>>>>>>>
> >>>>>>>>> I am getting a slow down during data ingestion. However, your
> >>> summary
> >>>>>>>>> figures don't show that in the ingest phase. The whole logs may
> >>> have
> >>>>>>>>> the signal in it but less pronounced.
> >>>>>>>>>
> >>>>>>>>> My working assumption is now that it is random access to the node
> >>>>>>>>> table. Your results point to it not being a CPU issue but that my
> >>>>>>>>> setup is saturating the I/O path. While the portable has a NVMe
> >>> SSD,
> >>>>>>>>> it has probably not got the same I/O bandwidth as a server class
> >>>>>>>>> machine.
> >>>>>>>>>
> >>>>>>>>> I'm not sure what to do about this other than run with a much
> >>> bigger
> >>>>>>>>> node table cache for the ingestion phase. Substituting some file
> >>>>>>>>> mapper file area for bigger cache should be a win. While I hadn't
> >>>>>>>>> noticed before, it is probably visible in logs of smaller loads
> on
> >>>>>>>>> closer inspection. Experimenting on a small dataset is a lot
> >>> easier.
> >>>>>>>> I'm more sure of this - not yet definite.
> >>>>>>>>
> >>>>>>>> The nodeToNodeId cache is 200k -- this is on the load/update path.
> >>>>>>>> Seems rather small for the task.
> >>>>>>>>
> >>>>>>>> The nodeIdToNode cache is 1e6 -- this is the one that is hit by
> >>> SPARQL
> >>>>>>>> results.
> >>>>>>>>
> >>>>>>>> 2 pieces of data will help:
> >>>>>>>>
> >>>>>>>> Experimenting with very small cache settings.
> >>>>>>>>
> >>>>>>>> Letting my slow load keep going to see if there is the same
> >>>>>>>> characteristics at the index stage. There shouldn't be if
> >>> nodeToNodeId
> >>>>>>>> is the cause; it's only an influence in the data ingestion step.
> >>>>>>>>
> >>>>>>>> Aside : Increasing nodeToNodeId could also help tdbloader=parallel
> >>> and
> >>>>>>>> maybe loader=phased. It falls into the same situation although the
> >>>>>>>> improvement there is going to be less marked. "Parallel" saturates
> >>> the
> >>>>>>>> I/O by other means as well.
> >>>>>>>>
> >>>>>>>>        Andy
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >> --
> >>
> >>
> >> ---
> >> Marco Neumann
> >> KONA
> >>
> >>
> >
>

Reply via email to