GitHub user maxx-ukoo created a discussion: How to load big dataset to new database
I am going to load the PubChem dataset (https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/, https://pubchem.ncbi.nlm.nih.gov/docs/rdf-load). I start Fuseki, stop it, and then try to upload the data into the dataset directory using the tdb2.tdbloader utility. I have a few questions: - What is the correct and fastest way to load this dataset? - Should I unzip the files before loading them? - Should I load the files one by one, or load all of them in a single tdb2.tdbloader run? - Why does the performance drop dramatically after a few million records? ```05:06:48 INFO loader :: Add: 3,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 120,685 / Avg: 186,474) 05:06:54 INFO loader :: Add: 4,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 166,917 / Avg: 181,167) 05:07:22 INFO loader :: Add: 5,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 35,449 / Avg: 99,427) 05:09:07 INFO loader :: Add: 6,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 9,560 / Avg: 38,737) 05:11:19 INFO loader :: Add: 7,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 7,545 / Avg: 24,355) 05:13:46 INFO loader :: Add: 8,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 6,839 / Avg: 18,449) 05:16:55 INFO loader :: Add: 9,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 5,270 / Avg: 14,437) 05:19:37 INFO loader :: Add: 10,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 6,190 / Avg: 12,740) 05:19:37 INFO loader :: Elapsed: 784.91 seconds [2026/01/15 05:19:37 UTC] 05:23:21 INFO loader :: Add: 11,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 4,456 / Avg: 10,898) 05:28:20 INFO loader :: Add: 12,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 3,350 / Avg: 9,175) 05:36:15 INFO loader :: Add: 13,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 2,104 / Avg: 7,290) 05:44:10 INFO loader :: Add: 14,000,000 pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 2,105 / Avg: 6,200)``` GitHub link: https://github.com/apache/jena/discussions/3701 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
