Hi I have a dataset which I would like to load into a TDB instance, and access with Fuseki. Certain aspects of the dataset and TDB make this challenging.
The dataset (provided by someone else) is split into about 80000 individual RDF/XML files, all in one directory by default. This makes even straightforward directory listings slow, and means that the command-line length/maximum number of arguments is exceeded, so I can't just refer to *.rdf. My first approach has been to create separate directories, each with about a third of the files, and use tdbloader to load each group in turn. I gave tdbloader 6Gb of memory (of the 7Gb available on the machine) and it took four hours to load and index the first group of files, a total of 207m triples in total. As Andy mentioned in a thread yesterday, the triples/sec count gradually declined over the course of the import (from about 30k/sec to 24k/sec). However when I tried to use tdbloader to load the next group of files into the same TDB, I found that performance declined dramatically - down to about 400 triples/sec right from the start. Is this expected behaviour? I wonder if it's because it's trying to add new data to an already indexed set - is this the case, and if so is there any way to improve the performance? Coming from a relational database background, my instinct would be to postpone indexing until all the triples were loaded (i.e. after the third group of files was imported), however I couldn't see any options affecting the index creation in tdbloader. Another question is whether the strategy I've adopted (i.e. loading 3 groups of ~27k files consecutively) is the correct one. The alternative would be to merge all 80k files into one in a separate step, then load the resulting humongous file. I suspect that there would be different issues with that approach. Is TDB even appropriate for this? Would (say) a MySQL-backed SDB instance be better? Or three separate TDB instances? Obviously the later would require some sort of query federation layer. I'm relatively new to this whole area so any tips on best practice would be appreciated. Regards Glenn.
