Re: Testing tdb2.xloader

Andy Seaborne Tue, 28 Dec 2021 05:21:48 -0800

Excellent news!

Updated: https://www.w3.org/wiki/LargeTripleStores


    Andy

On 28/12/2021 10:11, Marco Neumann wrote:

Ok here is another successful tdb2 load. this time with the full wikidata
download (20211222_latest-all.nt.gz 172G ) file.

counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
entire load.

I think with the right hardware this could easily be time compressed quite
a bit.

http://www.lotico.com/temp/LOG-45497

Best,
Marco


On Tue, Dec 21, 2021 at 8:59 AM Marco Neumann <marco.neum...@gmail.com>
wrote:

Thank you Andy. found it in revisions somewhere

just finished another run with truthy

http://lotico.com/temp/LOG-1214

will now increase RAM before running an additional load with increased
thread count.

Marco

On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne <a...@apache.org> wrote:

gists are git repos: so the file is there ... somewhere:


https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log

      Andy

On 19/12/2021 17:56, Marco Neumann wrote:

Thank you Lorenz,
unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
github


On Sun, Dec 19, 2021 at 9:46 AM LB <conpcompl...@googlemail.com

.invalid>

wrote:

I edited the Gist [1] and put the default stats there. Takes ~4min to
compute the stats.

Findings:

- for Wikidata we have to extend those stats with the stats for wdt:P31
property as Wikidata does use this property as their own rdf:type
relation. It is indeed trivial, just execute

select ?c (count(*) as ?cnt) {?s
<http://www.wikidata.org/prop/direct/P31> ?c} group by ?c

and convert it into the stats rule language (SSE) and put those rules
before the more generic rule

|(<http://www.wikidata.org/prop/direct/P31> 98152611)|

- I didn't want to touch the stats script itself, but we could for
example also make this type relation generic and allow for other like
wdt:P31 or skos:subject via a commandline option which provides any URI
as the type relation with default being rdf:type - but that's for sure
probably overkill

- there is a bug in the stats script or file I guess, because of of

some

overflow? the count value is

(count -1983667112))

which indicates this.  I opened a ticket:
https://issues.apache.org/jira/browse/JENA-2225


[1]

https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3


On 18.12.21 11:35, Marco Neumann wrote:

good morning Lorenz,

Maybe time to get a few query bencharms tests? :)

What does tdb2.tdbstats report?

Marco


On Sat, Dec 18, 2021 at 8:09 AM LB <conpcompl...@googlemail.com

.invalid>

wrote:

Good morning,

loading of Wikidata truthy is done, this time I didn't forget to keep
logs:

https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3


I'm a bit surprised that this time it was 8h faster than last time,

31h

vs 39h. Not sure if a) there was something else on the server last

time

(at least I couldn't see any running tasks) or b) if this is a
consequence of the more parallelized Unix sort now - I set it to
--parallel=16

I mean, the piped input stream is single threaded I guess, but maybe

the

sort merge step can benefit from more threads? I guess I have to

clean

up everything and run it again with the original setup with 2 Unix

sort

threads ...


On 16.12.21 14:48, Andy Seaborne wrote:


On 16/12/2021 10:52, Andy Seaborne wrote:
...

I am getting a slow down during data ingestion. However, your

summary

figures don't show that in the ingest phase. The whole logs may

have

the signal in it but less pronounced.

My working assumption is now that it is random access to the node
table. Your results point to it not being a CPU issue but that my
setup is saturating the I/O path. While the portable has a NVMe

SSD,

it has probably not got the same I/O bandwidth as a server class
machine.

I'm not sure what to do about this other than run with a much

bigger

node table cache for the ingestion phase. Substituting some file
mapper file area for bigger cache should be a win. While I hadn't
noticed before, it is probably visible in logs of smaller loads on
closer inspection. Experimenting on a small dataset is a lot

easier.

I'm more sure of this - not yet definite.

The nodeToNodeId cache is 200k -- this is on the load/update path.
Seems rather small for the task.

The nodeIdToNode cache is 1e6 -- this is the one that is hit by

SPARQL

results.

2 pieces of data will help:

Experimenting with very small cache settings.

Letting my slow load keep going to see if there is the same
characteristics at the index stage. There shouldn't be if

nodeToNodeId

is the cause; it's only an influence in the data ingestion step.

Aside : Increasing nodeToNodeId could also help tdbloader=parallel

and

maybe loader=phased. It falls into the same situation although the
improvement there is going to be less marked. "Parallel" saturates

the

I/O by other means as well.

       Andy



--


---
Marco Neumann
KONA

Re: Testing tdb2.xloader

Reply via email to