Strategies for loading large (>500m triples) datasets

Glenn Proctor Thu, 01 Mar 2012 01:54:09 -0800

Hi

I have a dataset which I would like to load into a TDB instance, and
access with Fuseki. Certain aspects of the dataset and TDB make this
challenging.


The dataset (provided by someone else) is split into about 80000
individual RDF/XML files, all in one directory by default. This makes
even straightforward directory listings slow, and means that the
command-line length/maximum number of arguments is exceeded, so I
can't just refer to *.rdf.

My first approach has been to create separate directories, each with
about a third of the files, and use tdbloader to load each group in
turn. I gave tdbloader 6Gb of memory (of the 7Gb available on the
machine) and it took four hours to load and index the first group of
files, a total of 207m triples in total. As Andy mentioned in a thread
yesterday, the triples/sec count gradually declined over the course of
the import (from about 30k/sec to 24k/sec).

However when I tried to use tdbloader to load the next group of files
into the same TDB, I found that performance declined dramatically -
down to about 400 triples/sec right from the start. Is this expected
behaviour? I wonder if it's because it's trying to add new data to an
already indexed set - is this the case, and if so is there any way to
improve the performance? Coming from a relational database background,
my instinct would be to postpone indexing until all the triples were
loaded (i.e. after the third group of files was imported), however I
couldn't see any options affecting the index creation in tdbloader.

Another question is whether the strategy I've adopted (i.e. loading 3
groups of ~27k files consecutively) is the correct one. The
alternative would be to merge all 80k files into one in a separate
step, then load the resulting humongous file. I suspect that there
would be different issues with that approach.

Is TDB even appropriate for this? Would (say) a MySQL-backed SDB
instance be better? Or three separate TDB instances? Obviously the
later would require some sort of query federation layer.

I'm relatively new to this whole area so any tips on best practice
would be appreciated.

Regards

Glenn.

Strategies for loading large (>500m triples) datasets

Reply via email to