[
https://issues.apache.org/jira/browse/JENA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Seaborne closed JENA-2314.
-------------------------------
> tdb2.tdbloader performance issue
> --------------------------------
>
> Key: JENA-2314
> URL: https://issues.apache.org/jira/browse/JENA-2314
> Project: Apache Jena
> Issue Type: Question
> Components: TDB2
> Affects Versions: Jena 4.0.0, Jena 4.2.0, Jena 4.4.0
> Environment: Java maximum memory: 12884901888
> symbol:http://jena.apache.org/ARQ#regexImpl =
> symbol:http://jena.apache.org/ARQ#javaRegex
> symbol:http://jena.apache.org/ARQ#registryFunctions =
> org.apache.jena.sparql.function.FunctionRegistry@1536602f
> symbol:http://jena.apache.org/ARQ#constantBNodeLabels = true
> symbol:http://jena.apache.org/ARQ#registryPropertyFunctions =
> org.apache.jena.sparql.pfunction.PropertyFunctionRegistry@4ebea12c
> symbol:http://jena.apache.org/ARQ#stageGenerator =
> org.apache.jena.tdb2.solver.StageGeneratorDirectTDB@2a1edad4
> symbol:http://jena.apache.org/ARQ#enablePropertyFunctions = true
> symbol:http://jena.apache.org/ARQ#strictSPARQL = false
> 13:02:36 INFO loader :: Loader = LoaderParallel
> 13:02:36 INFO loader :: Start: 6 files
> 13:02:48 INFO loader :: Add: 500,000 bdmhistoricalrecords.nq
> (Batch: 40,361 / Avg: 40,361)
> 13:03:00 INFO loader :: Add: 1,000,000 bdmhistoricalrecords.nq
> (Batch: 44,907 / Avg: 42,513)
> 13:03:10 INFO loader :: Add: 1,500,000 bdmhistoricalrecords.nq
> (Batch: 47,980 / Avg: 44,191)
> 13:03:25 INFO loader :: Add: 2,000,000 bdmhistoricalrecords.nq
> (Batch: 32,486 / Avg: 40,539)
> 13:33:06 INFO loader :: Add: 2,500,000 bdmhistoricalrecords.nq
> (Batch: 280 / Avg: 1,366)
> 14:30:30 INFO loader :: Add: 3,000,000 bdmhistoricalrecords.nq
> (Batch: 145 / Avg: 568)
> 14:52:29 INFO loader :: Add: 3,500,000 bdmhistoricalrecords.nq
> (Batch: 378 / Avg: 530)
> Reporter: R Pope
> Priority: Major
>
> Kia ora, Hi there,
> We have been using tdb2.tdbloader to load ~400,000,000 triples into our
> triplestore - all the data is in nq format being previoiusly converted from
> JSONLD. The files we are loading range from ~10GB to ~50GB producing a
> triplestore ~180GB including a text index. We run the loader in an HPC
> environment so we can request as much memory as we need, often using 1TB to
> do the load. The job is run in a Singularity image (similar to docker) and
> slurm is the chosen workload manager.
> All that aside, the load typically takes ~12-16hours but no more than 24
> hours with --loader=parallel and an average rate of ~5,000 triples per
> second. We haven't needed to run the loader since October 2021, so upon
> recently running the load job again we are getting a grand average of about
> ~500triples per second. Haven't been able to wait and see if it even finishes.
> Has anyone else experienced such a big performance loss with tdb2.tdbloader
> in the current or recent versions of jena? Apart from the potential
> investigation that can be done on the slurm/HPC side does anyone have advice
> around performance?
> Thanks in advance
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]