See also Stephen's jena-client work.
Any transaction needs a context - that can be an explicit transaction
object or it can be encapuslated. For the Jena API, that "something" is
the Dataset.
Whatever the transaction context is, one transaction context can have
only one transaction operation at a time. It is a restriction in TDB
that it's per-thread, and one that might bite (c.f. actors) but the idea
that a transaction has one operation active at a time is quite common
and simplifies thinking about the application. Multiple, truly
concurrent readers inside a single transaction makes some sense to me,
but multiple writers in one transaction seems to me to be a sign of a
weird design risking inconsistent application data views (i.e. needs to
add data-model level locking of some kind).
Instead, to have parallel activity you need multiple transaction
contexts. In JDBC, the transaction context is the connection. Maybe
Jena needs a "connection" concept although it is sort of there already
in the DatasetGraph (= DSG).
The DSG interface gets repurposed. TDB has various layers not all of
which are formally exposed as APIs.
First digression:
In TDB, there are read transactions and write transactions, which is why
begin takes an argument. It is more common to see begin(). TDB could
do this but then the issue of isolation becomes more prominent.
I expected some questions about why TDB has a different begin method
signature to normal but no one has commented (and below I'll say how it
could be begin()).
TDB isolation level is serializable - it is a natural consequence of the
design and it does cost in needing to keep more write state about for
longer. I think the better application guarantees are worth it; even
more, it's possible to implement a zero-copy design for this (currently
it is one-copy). Reads are repeatable and ranges are consistent. Given
that SPARQL solving is always lots of range scans, this is quite
convenient.
A minor advantage of knowing up-front it is a read transaction is that
the transaction DSG is, in fact, nothing more than a read-only wrapper
on the state of the system at the time the transaction starts. It is
very light-weight. The wrapper just blocks updates to indexes and the
node table. And it's cached anyway.
The use of DSGs is TDB is, from the bottom:
DatasetGraphTDB is the storage access. It is built on top of indexes
and when you first open a location, a DatasetGraphTDB is built whose
indexes are attached to the real storage.
A read transaction on a database that isn't updated, is working almost
directly on the raw storage. The read wrapper is cached so all readers
between changes due a writer are using the same object.
To create a write transaction, the system creates transactional indexes
and builds a DatasetGraphTDB - a transaction view of the storage that
puts chnages in-memory then in the journal, and only eventually gets
round to changing the main database.
Next there is DatasetGraphTxn. Its really only a bit of admin, tieing
the transactional DatasetGraphTDB to a transaction object that tracks
the state of the transaction.
One DatasetGraphTxn is created per transaction and they are one-time
use. There is no reason these can't be used on different threads but
there is an assumption there are no concurrent access (the real
requirement is MRSW).
Passing this DatasetGraphTxn object around is the transaction context.
It can be passed between treads.
Jena could do:
DatasetGraphTxn txn = connection.begin(Read/Write) ;
try {
.. multi threaded stuff
.... txn.commit/abort()
} finally { txn.end() }
and even that transaction commit/abort/end can be on any thread. Adding
end() was something else I was expecting people to comment on as it's a
bit different.
This is not the common paradigm for JDBC which is to change the state of
the JDBC connection.
connection.begin(Read/Write) ;
try {
.... connection.commit/abort()
} finally { connection.end() }
As we need connection operations, the "connection" is a DSG with a state
that holds the current DatasetGraphTxn. This is DatasetGraphTransaction
which uses which thread is calling to determine the transaction context.
The thread is the transaction context - so conveniently, the
application does not need to pass it around or into library code.
Autocommit:
There is no autocommit in TDB. This is driven by legacy. Old code that
does not touch transactions gets old-style raw access to he storage.
Autocommit seems to have no end of problems because it is going to be
very slow. Whether, long term, that's a good idea, I don't know.
Explicit transaction boundaries only seems rather harsh - pushing
concepts in the main public APIs where many simple applications simply
do not care about it.
At the Jena API, Dataset (DatasetImpl) is a wrapping a "Transactional"
object which is DatasetGraphTransaction if it's TDB, because
DatasetGraphTransaction implements DatasetGraph and implements
Transactional. It itself does not expose "Transactional" but that is
only to stop a proliferation of publicly visible interfaces. There is
no compile time non-transaction Datasets.
begin() vs begin(mode):
begin() can be done although that there is a bit of a catch in which
ever way you want to do it.
Naive way - all transactions are writers. But currently only one writer
can be active (there can be readers at the same time). So instead, run
in "read" mode and flip to write mode if any update is done.
Assume that any transactional DSG has the capability to do update - the
read-only optimizations of direct access where possible and caching of
reusable views isn't there. All transactions are writers but the
capability is latent.
We trap any update and flip the state of transaction to "write".
Now suppose that two transactions are active, reading ... one then the
other goes to write mode. We have the potential for a conflict - two
transactions want to make changes based on the same state of the
database. That can't happen in TDB at the moment. One writer at a time
and the writer starts with a view of the database that no one else will
be able to change.
Tradition transactional system handle this with locking. There is lock
promotion going on so the risk of incompatible request and having to
system abort a one or the other transaction exists. I didn't like that
- I preferred a clean, no spurious aborts point of view.
SS2PL (a two phase locking approach) could be but it is complicated. A
recurrent problem in RDF with locking is that triple-locking is going to
be expensive, probably very expensive (it's row level locking on very
small rows). Locking on a block leads to weird effects because a block
does not contain an application-understood chunk of data, unlike a
database where a (large) row or an application table is some logical
concept. And in TDB there are two tables - table locking is too coarse.
Graph locking is possible but would suggest per-graph indexes - a big
change and one that could have problems for default union graph query.
OK - that's a long enough ramble for now ... I hope how the insides of
TDB works is a bit clearer and how it relates to the API contract.
Andy