Re: Strategies for loading large (>500m triples) datasets

Glenn Proctor Thu, 01 Mar 2012 03:22:15 -0800

Hi Andrea

It's the PDB dataset, the files are about 300k on average, although
some are as big as 5Mb. I have played around with rapper and was
considering using ntriples files since they are more amenable to
simple command-line manipulation, so it's something I'll definitely
bear in mind.


Thanks

Glenn.


On Thu, Mar 1, 2012 at 10:29 AM, Andrea Splendiani
<[email protected]> wrote:
> Hi,
>
> just a question: how big are (on average) your files ?
> I have been dealing with large RDF files (like Uniprot) that comes in "many"
> small files.
> One way to go around the file number/size is to convert everything in
> m-triples (like with rapper,very fast). Then you can cut and merge files on
> the command line quite efficiently.
>
> best,
> Andrea
>
> Il giorno 01/mar/2012, alle ore 09.53, Glenn Proctor ha scritto:
>
>> Hi
>>
>> I have a dataset which I would like to load into a TDB instance, and
>> access with Fuseki. Certain aspects of the dataset and TDB make this
>> challenging.
>>
>> The dataset (provided by someone else) is split into about 80000
>> individual RDF/XML files, all in one directory by default. This makes
>> even straightforward directory listings slow, and means that the
>> command-line length/maximum number of arguments is exceeded, so I
>> can't just refer to *.rdf.
>>
>> My first approach has been to create separate directories, each with
>> about a third of the files, and use tdbloader to load each group in
>> turn. I gave tdbloader 6Gb of memory (of the 7Gb available on the
>> machine) and it took four hours to load and index the first group of
>> files, a total of 207m triples in total. As Andy mentioned in a thread
>> yesterday, the triples/sec count gradually declined over the course of
>> the import (from about 30k/sec to 24k/sec).
>>
>> However when I tried to use tdbloader to load the next group of files
>> into the same TDB, I found that performance declined dramatically -
>> down to about 400 triples/sec right from the start. Is this expected
>> behaviour? I wonder if it's because it's trying to add new data to an
>> already indexed set - is this the case, and if so is there any way to
>> improve the performance? Coming from a relational database background,
>> my instinct would be to postpone indexing until all the triples were
>> loaded (i.e. after the third group of files was imported), however I
>> couldn't see any options affecting the index creation in tdbloader.
>>
>> Another question is whether the strategy I've adopted (i.e. loading 3
>> groups of ~27k files consecutively) is the correct one. The
>> alternative would be to merge all 80k files into one in a separate
>> step, then load the resulting humongous file. I suspect that there
>> would be different issues with that approach.
>>
>> Is TDB even appropriate for this? Would (say) a MySQL-backed SDB
>> instance be better? Or three separate TDB instances? Obviously the
>> later would require some sort of query federation layer.
>>
>> I'm relatively new to this whole area so any tips on best practice
>> would be appreciated.
>>
>> Regards
>>
>> Glenn.
>
>

Re: Strategies for loading large (>500m triples) datasets

Reply via email to