TDB2 is an upgrade of TDB. It provides fully scalable transactions,e.g. loading 100's of million triples into a live database.

Releases will soon be available from maven central:

   <dependency>
     <groupId>org.seaborne.mantis</groupId>
     <artifactId>tdb2</artifactId>
     <version>0.2.0</version>
   </dependency>

** TDB2 databases are not compatible with Apache Jena TDB (TDB1). **
Bad things will happen.

Currently, it is undergoing final testing.



So what are the differences between TDB1 and TDB2?

== Technical Changes

= Indexes

Indexes are now copy-on-write B+Trees which are immutable once a transaction commits. Slightly confusingly, these are called "persistent datastructures" in the literature - this is not referring to be being on-disk but that the fact they are there permanently and not lost on a later update.

Jena TIM uses Dexx Collections for the same purpose.

Updates happen to the B+Trees as a write transaction progresses. The journal is now only a small amount of data to record the new state of the tree which is its root pointer, and 2 file limits for branches and leaves files. 24 bytes.

This has several desirable effects:

* Write-once
* Writer-pays
* No in-memory copy

Data is written straight into the indexes and is being flushed to disk by the OS while the transaction runs (i.e. its asynchronous to the data updates). Changes in the data do not go into the journal at all, only index state goes into the journal. In TDB1, changes are written to the journal then later written to disk as well as buffered in-memory. The final sync() happens as the writer commits.

Active readers do not hold up the write-back any more so that source of growing journals has been eliminated as well.

= Nodes

The node data is now held in a binary form (using RDF/Thrift).

The NodeId format has been revised: datatypes are always retained, even for inline values. (so, xsd:int does not become xsd:integer; "001" still becomes "1").

= Transactions

There is a completely new transaction mechanism. It is now a general framework that can work with multiple components. A TDB2 database is a number of such components - one per index, and also the node table. It could be enhanced to provide multiple dataset transactions and work with external indexes. The API on datasets is unchanged.

== Status

The one remaining work item is to provide storage reclamation. The index style means indexes grow in size. A means to GC the database, pruning it to a specific version is needed. At the moment, this can be done with a backup/resort.

== Possibilities

Given this design, some features are possible, i.e. could be done but aren't.

"See into the past" - a read-transaction can be started that sees some specific committed state from the past, not the latest commit. The database does not forget any committed changes unless storage is reclaimed.

This can also be used to reset the whole database to a point in the past and then allow it to evolve from there. (Actually, branching from the old version is also possible technically but will probably cause general chaos to have database that branched without a way to merge.)

Reply via email to