Re: Transaction APIs

Andy Seaborne Thu, 25 Oct 2012 07:22:01 -0700

See also Stephen's jena-client work.

Any transaction needs a context - that can be an explicit transactionobject or it can be encapuslated. For the Jena API, that "something" isthe Dataset.

Whatever the transaction context is, one transaction context can haveonly one transaction operation at a time. It is a restriction in TDBthat it's per-thread, and one that might bite (c.f. actors) but the ideathat a transaction has one operation active at a time is quite commonand simplifies thinking about the application. Multiple, trulyconcurrent readers inside a single transaction makes some sense to me,but multiple writers in one transaction seems to me to be a sign of aweird design risking inconsistent application data views (i.e. needs toadd data-model level locking of some kind).

Instead, to have parallel activity you need multiple transactioncontexts. In JDBC, the transaction context is the connection. MaybeJena needs a "connection" concept although it is sort of there alreadyin the DatasetGraph (= DSG).

The DSG interface gets repurposed. TDB has various layers not all ofwhich are formally exposed as APIs.


First digression:

In TDB, there are read transactions and write transactions, which is whybegin takes an argument. It is more common to see begin(). TDB coulddo this but then the issue of isolation becomes more prominent.

I expected some questions about why TDB has a different begin methodsignature to normal but no one has commented (and below I'll say how itcould be begin()).

TDB isolation level is serializable - it is a natural consequence of thedesign and it does cost in needing to keep more write state about forlonger. I think the better application guarantees are worth it; evenmore, it's possible to implement a zero-copy design for this (currentlyit is one-copy). Reads are repeatable and ranges are consistent. Giventhat SPARQL solving is always lots of range scans, this is quiteconvenient.

A minor advantage of knowing up-front it is a read transaction is thatthe transaction DSG is, in fact, nothing more than a read-only wrapperon the state of the system at the time the transaction starts. It isvery light-weight. The wrapper just blocks updates to indexes and thenode table. And it's cached anyway.


The use of DSGs is TDB is, from the bottom:

DatasetGraphTDB is the storage access. It is built on top of indexesand when you first open a location, a DatasetGraphTDB is built whoseindexes are attached to the real storage.

A read transaction on a database that isn't updated, is working almostdirectly on the raw storage. The read wrapper is cached so all readersbetween changes due a writer are using the same object.

To create a write transaction, the system creates transactional indexesand builds a DatasetGraphTDB - a transaction view of the storage thatputs chnages in-memory then in the journal, and only eventually getsround to changing the main database.

Next there is DatasetGraphTxn. Its really only a bit of admin, tieingthe transactional DatasetGraphTDB to a transaction object that tracksthe state of the transaction.

One DatasetGraphTxn is created per transaction and they are one-timeuse. There is no reason these can't be used on different threads butthere is an assumption there are no concurrent access (the realrequirement is MRSW).

Passing this DatasetGraphTxn object around is the transaction context.It can be passed between treads.


Jena could do:
        
DatasetGraphTxn txn = connection.begin(Read/Write) ;
try {
.. multi threaded stuff
.... txn.commit/abort()
} finally { txn.end() }

and even that transaction commit/abort/end can be on any thread. Addingend() was something else I was expecting people to comment on as it's abit different.

This is not the common paradigm for JDBC which is to change the state ofthe JDBC connection.


connection.begin(Read/Write) ;
try {
.... connection.commit/abort()
} finally { connection.end() }

As we need connection operations, the "connection" is a DSG with a statethat holds the current DatasetGraphTxn. This is DatasetGraphTransactionwhich uses which thread is calling to determine the transaction context.The thread is the transaction context - so conveniently, theapplication does not need to pass it around or into library code.


Autocommit:

There is no autocommit in TDB. This is driven by legacy. Old code thatdoes not touch transactions gets old-style raw access to he storage.Autocommit seems to have no end of problems because it is going to bevery slow. Whether, long term, that's a good idea, I don't know.Explicit transaction boundaries only seems rather harsh - pushingconcepts in the main public APIs where many simple applications simplydo not care about it.

At the Jena API, Dataset (DatasetImpl) is a wrapping a "Transactional"object which is DatasetGraphTransaction if it's TDB, becauseDatasetGraphTransaction implements DatasetGraph and implementsTransactional. It itself does not expose "Transactional" but that isonly to stop a proliferation of publicly visible interfaces. There isno compile time non-transaction Datasets.


begin() vs begin(mode):

begin() can be done although that there is a bit of a catch in whichever way you want to do it.

Naive way - all transactions are writers. But currently only one writercan be active (there can be readers at the same time). So instead, runin "read" mode and flip to write mode if any update is done.

Assume that any transactional DSG has the capability to do update - theread-only optimizations of direct access where possible and caching ofreusable views isn't there. All transactions are writers but thecapability is latent.


We trap any update and flip the state of transaction to "write".

Now suppose that two transactions are active, reading ... one then theother goes to write mode. We have the potential for a conflict - twotransactions want to make changes based on the same state of thedatabase. That can't happen in TDB at the moment. One writer at a timeand the writer starts with a view of the database that no one else willbe able to change.

Tradition transactional system handle this with locking. There is lockpromotion going on so the risk of incompatible request and having tosystem abort a one or the other transaction exists. I didn't like that- I preferred a clean, no spurious aborts point of view.

SS2PL (a two phase locking approach) could be but it is complicated. Arecurrent problem in RDF with locking is that triple-locking is going tobe expensive, probably very expensive (it's row level locking on verysmall rows). Locking on a block leads to weird effects because a blockdoes not contain an application-understood chunk of data, unlike adatabase where a (large) row or an application table is some logicalconcept. And in TDB there are two tables - table locking is too coarse.

Graph locking is possible but would suggest per-graph indexes - a bigchange and one that could have problems for default union graph query.

OK - that's a long enough ramble for now ... I hope how the insides ofTDB works is a bit clearer and how it relates to the API contract.


        Andy

Re: Transaction APIs

Reply via email to