Hi, just a question: how big are (on average) your files ? I have been dealing with large RDF files (like Uniprot) that comes in "many" small files. One way to go around the file number/size is to convert everything in m-triples (like with rapper,very fast). Then you can cut and merge files on the command line quite efficiently.
best, Andrea Il giorno 01/mar/2012, alle ore 09.53, Glenn Proctor ha scritto: > Hi > > I have a dataset which I would like to load into a TDB instance, and > access with Fuseki. Certain aspects of the dataset and TDB make this > challenging. > > The dataset (provided by someone else) is split into about 80000 > individual RDF/XML files, all in one directory by default. This makes > even straightforward directory listings slow, and means that the > command-line length/maximum number of arguments is exceeded, so I > can't just refer to *.rdf. > > My first approach has been to create separate directories, each with > about a third of the files, and use tdbloader to load each group in > turn. I gave tdbloader 6Gb of memory (of the 7Gb available on the > machine) and it took four hours to load and index the first group of > files, a total of 207m triples in total. As Andy mentioned in a thread > yesterday, the triples/sec count gradually declined over the course of > the import (from about 30k/sec to 24k/sec). > > However when I tried to use tdbloader to load the next group of files > into the same TDB, I found that performance declined dramatically - > down to about 400 triples/sec right from the start. Is this expected > behaviour? I wonder if it's because it's trying to add new data to an > already indexed set - is this the case, and if so is there any way to > improve the performance? Coming from a relational database background, > my instinct would be to postpone indexing until all the triples were > loaded (i.e. after the third group of files was imported), however I > couldn't see any options affecting the index creation in tdbloader. > > Another question is whether the strategy I've adopted (i.e. loading 3 > groups of ~27k files consecutively) is the correct one. The > alternative would be to merge all 80k files into one in a separate > step, then load the resulting humongous file. I suspect that there > would be different issues with that approach. > > Is TDB even appropriate for this? Would (say) a MySQL-backed SDB > instance be better? Or three separate TDB instances? Obviously the > later would require some sort of query federation layer. > > I'm relatively new to this whole area so any tips on best practice > would be appreciated. > > Regards > > Glenn.
