GitHub user maxx-ukoo added a comment to the discussion: How to load big dataset to new database
Currently, I have a database with 35 files loaded. The database has size approximately 400+ GB, and it contains 395,085,113 triples. The parallel loader works very slowly on this database: ``` 2026-01-19T16:37:26+00:00 Picked up JAVA_TOOL_OPTIONS: -Xmx8g -Xms8g Java maximum memory: 8589934592 symbol:http://jena.apache.org/ARQ#regexImpl = symbol:http://jena.apache.org/ARQ#javaRegex symbol:http://jena.apache.org/ARQ#registryFunctions = org.apache.jena.sparql.function.FunctionRegistry@60641ec8 symbol:http://jena.apache.org/ARQ#constantBNodeLabels = true symbol:http://jena.apache.org/ARQ#registryPropertyFunctions = org.apache.jena.sparql.pfunction.PropertyFunctionRegistry@75f65e45 symbol:http://jena.apache.org/ARQ#stageGenerator = org.apache.jena.tdb2.solver.StageGeneratorDirectTDB@6eeade6c symbol:http://jena.apache.org/ARQ#enablePropertyFunctions = true symbol:http://jena.apache.org/ARQ#registryServiceExecutors = org.apache.jena.sparql.service.ServiceExecutorRegistry@4a891c97 symbol:http://jena.apache.org/ARQ#strictSPARQL = false 16:37:26 INFO loader :: Loader = LoaderParallel 16:37:26 INFO loader :: Start: /data/source/compound/general/pc_compound2component.ttl.gz 16:37:29 INFO loader :: Add: 1,000,000 pc_compound2component.ttl.gz (Batch: 344,234 / Avg: 344,234) 16:47:34 INFO loader :: Add: 2,000,000 pc_compound2component.ttl.gz (Batch: 1,652 / Avg: 3,288) 17:00:16 INFO loader :: Add: 3,000,000 pc_compound2component.ttl.gz (Batch: 1,313 / Avg: 2,190) 17:24:05 INFO loader :: Add: 4,000,000 pc_compound2component.ttl.gz (Batch: 699 / Avg: 1,429) 18:21:36 INFO loader :: Add: 5,000,000 pc_compound2component.ttl.gz (Batch: 289 / Avg: 800) 19:21:23 INFO loader :: Add: 6,000,000 pc_compound2component.ttl.gz (Batch: 278 / Avg: 609) 20:31:19 INFO loader :: Add: 7,000,000 pc_compound2component.ttl.gz (Batch: 238 / Avg: 498) 21:59:10 INFO loader :: Add: 8,000,000 pc_compound2component.ttl.gz (Batch: 189 / Avg: 414) 23:34:39 INFO loader :: Add: 9,000,000 pc_compound2component.ttl.gz (Batch: 174 / Avg: 359) 02:07:46 INFO loader :: Add: 10,000,000 pc_compound2component.ttl.gz (Batch: 108 / Avg: 292) 02:07:46 INFO loader :: Elapsed: 34,219.43 seconds [2026/01/20 02:07:46 UTC] 05:36:22 INFO loader :: Add: 11,000,000 pc_compound2component.ttl.gz (Batch: 79 / Avg: 235) ^C ``` However, when I run the loader on a clean (empty) database folder on the same hardware, it works much faster: ``` Picked up JAVA_TOOL_OPTIONS: -Xmx8g -Xms8g Java maximum memory: 8589934592 symbol:http://jena.apache.org/ARQ#regexImpl = symbol:http://jena.apache.org/ARQ#javaRegex symbol:http://jena.apache.org/ARQ#registryFunctions = org.apache.jena.sparql.function.FunctionRegistry@60641ec8 symbol:http://jena.apache.org/ARQ#constantBNodeLabels = true symbol:http://jena.apache.org/ARQ#registryPropertyFunctions = org.apache.jena.sparql.pfunction.PropertyFunctionRegistry@75f65e45 symbol:http://jena.apache.org/ARQ#stageGenerator = org.apache.jena.tdb2.solver.StageGeneratorDirectTDB@6eeade6c symbol:http://jena.apache.org/ARQ#enablePropertyFunctions = true symbol:http://jena.apache.org/ARQ#registryServiceExecutors = org.apache.jena.sparql.service.ServiceExecutorRegistry@4a891c97 symbol:http://jena.apache.org/ARQ#strictSPARQL = false 07:54:19 INFO loader :: Loader = LoaderParallel 07:54:19 INFO loader :: Start: /data/source/compound/general/pc_compound2component.ttl.gz 07:54:22 INFO loader :: Add: 1,000,000 pc_compound2component.ttl.gz (Batch: 339,097 / Avg: 339,097) 07:54:28 INFO loader :: Add: 2,000,000 pc_compound2component.ttl.gz (Batch: 171,438 / Avg: 227,738) 07:54:38 INFO loader :: Add: 3,000,000 pc_compound2component.ttl.gz (Batch: 104,657 / Avg: 163,603) 07:54:48 INFO loader :: Add: 4,000,000 pc_compound2component.ttl.gz (Batch: 101,719 / Avg: 142,005) 07:54:58 INFO loader :: Add: 5,000,000 pc_compound2component.ttl.gz (Batch: 91,996 / Avg: 128,080) 07:55:10 INFO loader :: Add: 6,000,000 pc_compound2component.ttl.gz (Batch: 85,251 / Avg: 118,184) 07:55:22 INFO loader :: Add: 7,000,000 pc_compound2component.ttl.gz (Batch: 84,709 / Avg: 111,869) 07:55:32 INFO loader :: Add: 8,000,000 pc_compound2component.ttl.gz (Batch: 98,911 / Avg: 110,067) 07:55:43 INFO loader :: Add: 9,000,000 pc_compound2component.ttl.gz (Batch: 90,009 / Avg: 107,407) 07:55:55 INFO loader :: Add: 10,000,000 pc_compound2component.ttl.gz (Batch: 86,550 / Avg: 104,880) 07:55:55 INFO loader :: Elapsed: 95.35 seconds [2026/01/20 07:55:55 UTC] 07:56:05 INFO loader :: Add: 11,000,000 pc_compound2component.ttl.gz (Batch: 96,413 / Avg: 104,049) 07:56:15 INFO loader :: Add: 12,000,000 pc_compound2component.ttl.gz (Batch: 103,241 / Avg: 103,981) 07:56:24 INFO loader :: Add: 13,000,000 pc_compound2component.ttl.gz (Batch: 109,301 / Avg: 104,372) 07:56:33 INFO loader :: Add: 14,000,000 pc_compound2component.ttl.gz (Batch: 110,132 / Avg: 104,763) 07:56:41 INFO loader :: Add: 15,000,000 pc_compound2component.ttl.gz (Batch: 118,245 / Avg: 105,566) 07:56:50 INFO loader :: Add: 16,000,000 pc_compound2component.ttl.gz (Batch: 116,604 / Avg: 106,194) 07:56:58 INFO loader :: Add: 17,000,000 pc_compound2component.ttl.gz (Batch: 122,070 / Avg: 107,013) 07:57:07 INFO loader :: Add: 18,000,000 pc_compound2component.ttl.gz (Batch: 120,279 / Avg: 107,672) 07:57:15 INFO loader :: Add: 19,000,000 pc_compound2component.ttl.gz (Batch: 123,046 / Avg: 108,385) 07:57:23 INFO loader :: Add: 20,000,000 pc_compound2component.ttl.gz (Batch: 120,729 / Avg: 108,942) 07:57:23 INFO loader :: Elapsed: 183.58 seconds [2026/01/20 07:57:23 UTC] 07:57:31 INFO loader :: Add: 21,000,000 pc_compound2component.ttl.gz (Batch: 118,891 / Avg: 109,378) 07:57:39 INFO loader :: Add: 22,000,000 pc_compound2component.ttl.gz (Batch: 123,777 / Avg: 109,959) 07:57:48 INFO loader :: Add: 23,000,000 pc_compound2component.ttl.gz (Batch: 122,204 / Avg: 110,440) 07:57:55 INFO loader :: Add: 24,000,000 pc_compound2component.ttl.gz (Batch: 130,633 / Avg: 111,156) 07:58:05 INFO loader :: Add: 25,000,000 pc_compound2component.ttl.gz (Batch: 108,201 / Avg: 111,035) 07:58:13 INFO loader :: Add: 26,000,000 pc_compound2component.ttl.gz (Batch: 113,778 / Avg: 111,138) 07:58:23 INFO loader :: Add: 27,000,000 pc_compound2component.ttl.gz (Batch: 107,781 / Avg: 111,010) 07:58:29 INFO loader :: Finished: /data/source/compound/general/pc_compound2component.ttl.gz: 27,762,031 tuples in 249.16s (Avg: 111,421) 07:58:44 INFO loader :: Finish - index SPOG 07:58:44 INFO loader :: Finish - index GSPO 07:58:53 INFO loader :: Finish - index GOSP 07:58:54 INFO loader :: Finish - index GPOS 07:58:54 INFO loader :: Finish - index POSG 07:58:54 INFO loader :: Finish - index OSPG 07:58:54 INFO loader :: Time = 274.513 seconds : Quads = 27,762,031 : Rate = 101,132 /s 2026-01-20T07:58:55+00:00 ``` The performance degradation becomes noticeable when the database size reaches around 300–400 GB. I see high disk usage in this case: <img width="3118" height="216" alt="image" src="https://github.com/user-attachments/assets/6cd0798f-5583-4725-926e-4d068473e3d3" /> GitHub link: https://github.com/apache/jena/discussions/3701#discussioncomment-15548009 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
