Paolo Castagna wrote:
TODO:
- Add MiniMRCluster so that it is easy for developers to run tests with
multiple reducers on a laptop.
Done.
- Split the first MapReduce job into two: one to produce offset values for
each partition, the other to generate data files with correct ids for
subsequent jobs.
Done.
- Build the node table concatenating output files from the MapReduce jobs
above.
Done.
All the changes are in a branch, here:
https://github.com/castagna/tdbloader3/tree/hadoop-0.20.203.0
There is only one final step which is currently not done using MapReduce:
the node2id.dat|idn files (i.e. the B+Tree index to map RDF node hashes of 128
bits to RDF node ids (68 bits)) are built from the nodes.dat file at the end
of all MapReduce jobs.
Iterator<Pair<Long,ByteBuffer>> iter = objects.all();
while ( iter.hasNext() ) {
Pair<Long, ByteBuffer> pair = iter.next();
long id = pair.getLeft() ;
Node node = NodeLib.fetchDecode(id, objects) ;
Hash hash = new Hash(recordFactory.keyLength()) ;
setHash(hash, node) ;
byte k[] = hash.getBytes() ;
Record record = recordFactory.create(k) ;
Bytes.setLong(id, record.getValue(), 0) ;
nodeToId.add(record);
}
I need to run a few experiments, but this saves a find() to search if a record
is already in the index. We know the objects file contains only unique RDF node
values.
Indeed, while I was doing this I looked back at tdbloader2 and I think we could
use the BPlusTreeRewriter 'trick' for the node table as well. I cannot reuse
BPlusTreeRewriter as it is since it has been written for SPO or GSPO, etc.
indexes where we have records with 3 or 4 slots of constant size (in this case
64 bits).
In the case of the node table we have records of two slots only respectively of
size 128 bits for the hash and 68 bits for the node id.
I am keen to try to improve the first phase of the tdbloader2 since I expect it
could further improve performances and scalability (in particular when the node
table indexes do not fit in RAM anymore).
@Andy, does this idea make sense?
- Test on a cluster with a large (> 1B dataset).
Soon...
Paolo