Re: Strategies for loading large (>500m triples) datasets

Andrea Splendiani Thu, 01 Mar 2012 02:35:35 -0800

Hi,

just a question: how big are (on average) your files ?
I have been dealing with large RDF files (like Uniprot) that comes in "many"
small files.
One way to go around the file number/size is to convert everything in
m-triples (like with rapper,very fast). Then you can cut and merge files on
the command line quite efficiently.


best,
Andrea

Il giorno 01/mar/2012, alle ore 09.53, Glenn Proctor ha scritto:

> Hi
>
> I have a dataset which I would like to load into a TDB instance, and
> access with Fuseki. Certain aspects of the dataset and TDB make this
> challenging.
>
> The dataset (provided by someone else) is split into about 80000
> individual RDF/XML files, all in one directory by default. This makes
> even straightforward directory listings slow, and means that the
> command-line length/maximum number of arguments is exceeded, so I
> can't just refer to *.rdf.
>
> My first approach has been to create separate directories, each with
> about a third of the files, and use tdbloader to load each group in
> turn. I gave tdbloader 6Gb of memory (of the 7Gb available on the
> machine) and it took four hours to load and index the first group of
> files, a total of 207m triples in total. As Andy mentioned in a thread
> yesterday, the triples/sec count gradually declined over the course of
> the import (from about 30k/sec to 24k/sec).
>
> However when I tried to use tdbloader to load the next group of files
> into the same TDB, I found that performance declined dramatically -
> down to about 400 triples/sec right from the start. Is this expected
> behaviour? I wonder if it's because it's trying to add new data to an
> already indexed set - is this the case, and if so is there any way to
> improve the performance? Coming from a relational database background,
> my instinct would be to postpone indexing until all the triples were
> loaded (i.e. after the third group of files was imported), however I
> couldn't see any options affecting the index creation in tdbloader.
>
> Another question is whether the strategy I've adopted (i.e. loading 3
> groups of ~27k files consecutively) is the correct one. The
> alternative would be to merge all 80k files into one in a separate
> step, then load the resulting humongous file. I suspect that there
> would be different issues with that approach.
>
> Is TDB even appropriate for this? Would (say) a MySQL-backed SDB
> instance be better? Or three separate TDB instances? Obviously the
> later would require some sort of query federation layer.
>
> I'm relatively new to this whole area so any tips on best practice
> would be appreciated.
>
> Regards
>
> Glenn.

Re: Strategies for loading large (>500m triples) datasets

Reply via email to