GitHub user maxx-ukoo created a discussion: How to load big dataset to new 
database

I am going to load the PubChem dataset 
(https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/, 
https://pubchem.ncbi.nlm.nih.gov/docs/rdf-load).
I start Fuseki, stop it, and then try to upload the data into the dataset 
directory using the tdb2.tdbloader utility.
I have a few questions:

- What is the correct and fastest way to load this dataset?
- Should I unzip the files before loading them?
- Should I load the files one by one, or load all of them in a single 
tdb2.tdbloader run?
- Why does the performance drop dramatically after a few million records?

```05:06:48 INFO  loader          :: Add: 3,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 120,685 / Avg: 
186,474)
05:06:54 INFO  loader          :: Add: 4,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 166,917 / Avg: 
181,167)
05:07:22 INFO  loader          :: Add: 5,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 35,449 / Avg: 
99,427)
05:09:07 INFO  loader          :: Add: 6,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 9,560 / Avg: 38,737)
05:11:19 INFO  loader          :: Add: 7,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 7,545 / Avg: 24,355)
05:13:46 INFO  loader          :: Add: 8,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 6,839 / Avg: 18,449)
05:16:55 INFO  loader          :: Add: 9,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 5,270 / Avg: 14,437)
05:19:37 INFO  loader          :: Add: 10,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 6,190 / Avg: 12,740)
05:19:37 INFO  loader          ::   Elapsed: 784.91 seconds [2026/01/15 
05:19:37 UTC]
05:23:21 INFO  loader          :: Add: 11,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 4,456 / Avg: 10,898)
05:28:20 INFO  loader          :: Add: 12,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 3,350 / Avg: 9,175)
05:36:15 INFO  loader          :: Add: 13,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 2,104 / Avg: 7,290)
05:44:10 INFO  loader          :: Add: 14,000,000 
pc_compound2defined_atom_stereo_count_000005.ttl.gz (Batch: 2,105 / Avg: 
6,200)```

GitHub link: https://github.com/apache/jena/discussions/3701

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to