Hi TDB has a couple of command line utilities to initially load (i.e. build TDB indexes) some RDF data into an empty TDB store. These commands are known as tdbloader [1] and tdbloader2 [2].
tdbloader is a Java program and it uses the classes in c.h.h.j.tdb.store.bulkloader package, it builds the node table and the primary indexes first. After that it builds the secondary indexes sequentially. Even if there are different policies (see: BuilderSecondaryIndexes implementations). tdbloader can be also used incrementally (with little care in relation to your stats.opt file if you use one). tdbloader2 [3,4] is a bash script with some Java code (where the magic happens). tdbloader2 builds just the node table and text files for the input data using Node ids. It then uses UNIX sort to sort the text files and produce test files ready to be streamed into BPlusTreeRewriter to generate B+Tree indexes [5]. tdbloader2 cannot be used to incrementally load new data into an existing TDB database. The more RAM you have the better. RAM is becoming cheaper and cheaper (unless you use VMs in the cloud :-)). tdbloader serves us well for datasets of a few millions triples|quads up to a 100M triples|quads, depending on how much RAM you have. tdbloader2 serves us well for datasets of a few 100M triples|quads, but depending on how much RAM you have, its performances can drop as your dataset approaches 1B or more triples|quads. What could we do to scale TDB loading further? We (@ Talis) use MapReduce in our ingestion pipeline to validate|clean the data or to build the indexes we need. However, we are still using tdbloader|tdbloader2 and that is becoming the bottleneck of our ingestion pipeline (and an increasing cost) for datasets on the order of billion of triples|quads. Is it possible to use MapReduce to generate 'standard' TDB indexes? The answer to this question is still open. We have shared a prototype (a.k.a. tdbloader3 [6,7]) which seems to provide a positive answer to that question. However, there is still a big problem with the current implementation: the first MapReduce job mush use a single reducer, which is a very bad practice and quickly become your new bottleneck. In other words, a MapReduce antipattern against scalability. So, in practice, we still don't have a way to build our TDB indexes using MapReduce. However, it might be possible to split the first MapReduce job into two MapReduce jobs (with multiple reducers each). This would remove the single reducer. The fact that the last MapReduce jobs uses 9 reducers (one for each TDB index) would probably become the next scalability bottleneck. However, we hope this will, in practice, not be a problem for us, at least for a while. Another alternative would be to (optionally) have a different node table implementation where node ids are not offset into a file where the RDF node values are stored. In this scenario, we would use hash codes as node ids, so the node to node id is direct and it does not involve any index. But we would need indexes to map a node id into an hash and the hash into a file offset where to retrieve the node value. The first alternative seems more promising to me. TODO: - Add MiniMRCluster so that it is easy for developers to run tests with multiple reducers on a laptop. - Split the first MapReduce job into two: one to produce offset values for each partition, the other to generate data files with correct ids for subsequent jobs. - Build the node table concatenating output files from the MapReduce jobs above. - Test on a cluster with a large (> 1B dataset). Help welcome. :-) Paolo [1] http://openjena.org/wiki/TDB/Commands#tdbloader [2] http://openjena.org/wiki/TDB/Commands#tdbloader2 [3] http://seaborne.blogspot.com/2010/12/tdb-bulk-loader-version-2.html [4] http://seaborne.blogspot.com/2010/12/performance-benchmarks-for-tdb-loader.html [5] http://seaborne.blogspot.com/2010/12/repacking-btrees.html [6] https://github.com/castagna/tdbloader3 [7] http://markmail.org/thread/573frdoq2rqgm2gg
