See also Stephen's jena-client work.

Any transaction needs a context - that can be an explicit transaction object or it can be encapuslated. For the Jena API, that "something" is the Dataset.

Whatever the transaction context is, one transaction context can have only one transaction operation at a time. It is a restriction in TDB that it's per-thread, and one that might bite (c.f. actors) but the idea that a transaction has one operation active at a time is quite common and simplifies thinking about the application. Multiple, truly concurrent readers inside a single transaction makes some sense to me, but multiple writers in one transaction seems to me to be a sign of a weird design risking inconsistent application data views (i.e. needs to add data-model level locking of some kind).

Instead, to have parallel activity you need multiple transaction contexts. In JDBC, the transaction context is the connection. Maybe Jena needs a "connection" concept although it is sort of there already in the DatasetGraph (= DSG).

The DSG interface gets repurposed. TDB has various layers not all of which are formally exposed as APIs.

First digression:

In TDB, there are read transactions and write transactions, which is why begin takes an argument. It is more common to see begin(). TDB could do this but then the issue of isolation becomes more prominent.

I expected some questions about why TDB has a different begin method signature to normal but no one has commented (and below I'll say how it could be begin()).

TDB isolation level is serializable - it is a natural consequence of the design and it does cost in needing to keep more write state about for longer. I think the better application guarantees are worth it; even more, it's possible to implement a zero-copy design for this (currently it is one-copy). Reads are repeatable and ranges are consistent. Given that SPARQL solving is always lots of range scans, this is quite convenient.

A minor advantage of knowing up-front it is a read transaction is that the transaction DSG is, in fact, nothing more than a read-only wrapper on the state of the system at the time the transaction starts. It is very light-weight. The wrapper just blocks updates to indexes and the node table. And it's cached anyway.

The use of DSGs is TDB is, from the bottom:

DatasetGraphTDB is the storage access. It is built on top of indexes and when you first open a location, a DatasetGraphTDB is built whose indexes are attached to the real storage.

A read transaction on a database that isn't updated, is working almost directly on the raw storage. The read wrapper is cached so all readers between changes due a writer are using the same object.

To create a write transaction, the system creates transactional indexes and builds a DatasetGraphTDB - a transaction view of the storage that puts chnages in-memory then in the journal, and only eventually gets round to changing the main database.

Next there is DatasetGraphTxn. Its really only a bit of admin, tieing the transactional DatasetGraphTDB to a transaction object that tracks the state of the transaction.

One DatasetGraphTxn is created per transaction and they are one-time use. There is no reason these can't be used on different threads but there is an assumption there are no concurrent access (the real requirement is MRSW).

Passing this DatasetGraphTxn object around is the transaction context. It can be passed between treads.

Jena could do:
        
DatasetGraphTxn txn = connection.begin(Read/Write) ;
try {
.. multi threaded stuff
.... txn.commit/abort()
} finally { txn.end() }

and even that transaction commit/abort/end can be on any thread. Adding end() was something else I was expecting people to comment on as it's a bit different.

This is not the common paradigm for JDBC which is to change the state of the JDBC connection.

connection.begin(Read/Write) ;
try {
.... connection.commit/abort()
} finally { connection.end() }

As we need connection operations, the "connection" is a DSG with a state that holds the current DatasetGraphTxn. This is DatasetGraphTransaction which uses which thread is calling to determine the transaction context. The thread is the transaction context - so conveniently, the application does not need to pass it around or into library code.

Autocommit:

There is no autocommit in TDB. This is driven by legacy. Old code that does not touch transactions gets old-style raw access to he storage. Autocommit seems to have no end of problems because it is going to be very slow. Whether, long term, that's a good idea, I don't know. Explicit transaction boundaries only seems rather harsh - pushing concepts in the main public APIs where many simple applications simply do not care about it.

At the Jena API, Dataset (DatasetImpl) is a wrapping a "Transactional" object which is DatasetGraphTransaction if it's TDB, because DatasetGraphTransaction implements DatasetGraph and implements Transactional. It itself does not expose "Transactional" but that is only to stop a proliferation of publicly visible interfaces. There is no compile time non-transaction Datasets.

begin() vs begin(mode):

begin() can be done although that there is a bit of a catch in which ever way you want to do it.

Naive way - all transactions are writers. But currently only one writer can be active (there can be readers at the same time). So instead, run in "read" mode and flip to write mode if any update is done.

Assume that any transactional DSG has the capability to do update - the read-only optimizations of direct access where possible and caching of reusable views isn't there. All transactions are writers but the capability is latent.

We trap any update and flip the state of transaction to "write".

Now suppose that two transactions are active, reading ... one then the other goes to write mode. We have the potential for a conflict - two transactions want to make changes based on the same state of the database. That can't happen in TDB at the moment. One writer at a time and the writer starts with a view of the database that no one else will be able to change.

Tradition transactional system handle this with locking. There is lock promotion going on so the risk of incompatible request and having to system abort a one or the other transaction exists. I didn't like that - I preferred a clean, no spurious aborts point of view.

SS2PL (a two phase locking approach) could be but it is complicated. A recurrent problem in RDF with locking is that triple-locking is going to be expensive, probably very expensive (it's row level locking on very small rows). Locking on a block leads to weird effects because a block does not contain an application-understood chunk of data, unlike a database where a (large) row or an application table is some logical concept. And in TDB there are two tables - table locking is too coarse.

Graph locking is possible but would suggest per-graph indexes - a big change and one that could have problems for default union graph query.

OK - that's a long enough ramble for now ... I hope how the insides of TDB works is a bit clearer and how it relates to the API contract.

        Andy

Reply via email to