TDB2 is an upgrade of TDB. It provides fully scalable transactions,e.g.
loading 100's of million triples into a live database.
Releases will soon be available from maven central:
<dependency>
<groupId>org.seaborne.mantis</groupId>
<artifactId>tdb2</artifactId>
<version>0.2.0</version>
</dependency>
** TDB2 databases are not compatible with Apache Jena TDB (TDB1). **
Bad things will happen.
Currently, it is undergoing final testing.
So what are the differences between TDB1 and TDB2?
== Technical Changes
= Indexes
Indexes are now copy-on-write B+Trees which are immutable once a
transaction commits. Slightly confusingly, these are called "persistent
datastructures" in the literature - this is not referring to be being
on-disk but that the fact they are there permanently and not lost on a
later update.
Jena TIM uses Dexx Collections for the same purpose.
Updates happen to the B+Trees as a write transaction progresses. The
journal is now only a small amount of data to record the new state of
the tree which is its root pointer, and 2 file limits for branches and
leaves files. 24 bytes.
This has several desirable effects:
* Write-once
* Writer-pays
* No in-memory copy
Data is written straight into the indexes and is being flushed to disk
by the OS while the transaction runs (i.e. its asynchronous to the data
updates). Changes in the data do not go into the journal at all, only
index state goes into the journal. In TDB1, changes are written to the
journal then later written to disk as well as buffered in-memory. The
final sync() happens as the writer commits.
Active readers do not hold up the write-back any more so that source of
growing journals has been eliminated as well.
= Nodes
The node data is now held in a binary form (using RDF/Thrift).
The NodeId format has been revised: datatypes are always retained, even
for inline values. (so, xsd:int does not become xsd:integer; "001" still
becomes "1").
= Transactions
There is a completely new transaction mechanism. It is now a general
framework that can work with multiple components. A TDB2 database is a
number of such components - one per index, and also the node table. It
could be enhanced to provide multiple dataset transactions and work with
external indexes. The API on datasets is unchanged.
== Status
The one remaining work item is to provide storage reclamation. The index
style means indexes grow in size. A means to GC the database, pruning
it to a specific version is needed. At the moment, this can be done
with a backup/resort.
== Possibilities
Given this design, some features are possible, i.e. could be done but
aren't.
"See into the past" - a read-transaction can be started that sees some
specific committed state from the past, not the latest commit. The
database does not forget any committed changes unless storage is reclaimed.
This can also be used to reset the whole database to a point in the past
and then allow it to evolve from there. (Actually, branching from the
old version is also possible technically but will probably cause general
chaos to have database that branched without a way to merge.)