Hi Andrea It's the PDB dataset, the files are about 300k on average, although some are as big as 5Mb. I have played around with rapper and was considering using ntriples files since they are more amenable to simple command-line manipulation, so it's something I'll definitely bear in mind.
Thanks Glenn. On Thu, Mar 1, 2012 at 10:29 AM, Andrea Splendiani <[email protected]> wrote: > Hi, > > just a question: how big are (on average) your files ? > I have been dealing with large RDF files (like Uniprot) that comes in "many" > small files. > One way to go around the file number/size is to convert everything in > m-triples (like with rapper,very fast). Then you can cut and merge files on > the command line quite efficiently. > > best, > Andrea > > Il giorno 01/mar/2012, alle ore 09.53, Glenn Proctor ha scritto: > >> Hi >> >> I have a dataset which I would like to load into a TDB instance, and >> access with Fuseki. Certain aspects of the dataset and TDB make this >> challenging. >> >> The dataset (provided by someone else) is split into about 80000 >> individual RDF/XML files, all in one directory by default. This makes >> even straightforward directory listings slow, and means that the >> command-line length/maximum number of arguments is exceeded, so I >> can't just refer to *.rdf. >> >> My first approach has been to create separate directories, each with >> about a third of the files, and use tdbloader to load each group in >> turn. I gave tdbloader 6Gb of memory (of the 7Gb available on the >> machine) and it took four hours to load and index the first group of >> files, a total of 207m triples in total. As Andy mentioned in a thread >> yesterday, the triples/sec count gradually declined over the course of >> the import (from about 30k/sec to 24k/sec). >> >> However when I tried to use tdbloader to load the next group of files >> into the same TDB, I found that performance declined dramatically - >> down to about 400 triples/sec right from the start. Is this expected >> behaviour? I wonder if it's because it's trying to add new data to an >> already indexed set - is this the case, and if so is there any way to >> improve the performance? Coming from a relational database background, >> my instinct would be to postpone indexing until all the triples were >> loaded (i.e. after the third group of files was imported), however I >> couldn't see any options affecting the index creation in tdbloader. >> >> Another question is whether the strategy I've adopted (i.e. loading 3 >> groups of ~27k files consecutively) is the correct one. The >> alternative would be to merge all 80k files into one in a separate >> step, then load the resulting humongous file. I suspect that there >> would be different issues with that approach. >> >> Is TDB even appropriate for this? Would (say) a MySQL-backed SDB >> instance be better? Or three separate TDB instances? Obviously the >> later would require some sort of query federation layer. >> >> I'm relatively new to this whole area so any tips on best practice >> would be appreciated. >> >> Regards >> >> Glenn. > >
