Paolo Castagna wrote: > Hi, > in the last days I made some experiments on different (hopefully more > scalable, > in particular on machines with RAM constraints) ways to generate TDB > indexes. > These improvements could be beneficial for tdbloader2 or a pure Java > version > of it (see: [1]). One specific thing, in particular, is necessary to > complete > tdbloader3 (i.e. a MapReduce implementation of a TDB loader). > > This email focuses on the node table only and more precisely on the B+Tree > index of the node table. Such index has records with keys of 128 bit, which > represent the hash of RDF node values, and values of 68 bit, which > represent > the corresponding node ids. This index is used to, given an RDF node, > retrieve > its node id. This is used to replace RDF node values before executing a > query > (since querie use indexes with node ids only in it). > > I'd like to be able to use the same technique used by tdbloader2 on the > final > stage for the SPO, POS, OSP, GSPO, GPOS, etc. B+Tree indexes to build the > B+Tree index of the node table (see: [2]). > > I know how to generate and sort a file containing hash|id, see [3] for > example. > > However, I don't think the current BPlusTreeRewriter can be used as it is > to rebuild a B+Tree index from such a file. I think the main reason is > because it uses createKeyOnly(). > > Is that the only obstacle or it's much more complicate than that? > > Is it possible to change/adapt/extend BPlusTreeRewriter to support this use > case as well?
Well, I was wrong: BPlusTreeRewriter works with Records with values as well. Here: https://github.com/castagna/tdbloader3/blob/master/src/main/java/org/apache/jena/tdbloader3/NodeTableRewriter.java https://github.com/castagna/tdbloader3/blob/master/src/test/java/org/apache/jena/tdbloader3/TestNodeTableRewriter.java https://github.com/castagna/tdbloader3/blob/master/src/main/java/cmd/nodetablebuilder.java This can helps JENA-117 (i.e. a pure Java version of tdbloader2). More tests are necessary to establish if that would be faster than the current one. Paolo > > Thanks, > Paolo > > [1] https://issues.apache.org/jira/browse/JENA-117 > [2] http://seaborne.blogspot.com/2010/12/repacking-btrees.html > [3] > https://github.com/castagna/tdbloader3/blob/master/src/main/java/org/apache/jena/tdbloader3/NodeTableBuilder.java#L97 >
