Hi Andy,

This is rather long email I'm afraid, but I've split it into two parts: 
Parliament and TxTDB comments:


Parliament
----------

The unit of MVCC I'm looking at is per quad.  This is similar to PostgreSQL's 
row-level MVCC.  It also means that the WAL is used for durability guarantees 
and transaction rollback rather than concurrency control.  The WAL will track 
statement insertions/deletions and disk page states on the first modification 
after a checkpoint.

Each statement in Parliament will have two transaction IDs: a creation ID that 
identifies the transaction that created the statement, and an expiration ID 
that identifies the transaction that expired (deleted) the statement.  As 
statements are immutable (statements can only be created or deleted), we can 
expire a statement by atomically updating the expired ID field and avoid 
locking.  When a statement is expired, Parliament retains it until all 
transactions that were in process at the time of expiration have completed 
(after which time the statement is allowed to be removed by the background 
vacuum task).

Read iterators can determine if any particular statement is visible by 
recording two things at the start of the transaction: 1) the current 
transaction ID, and 2) all in-process transaction IDs.  When a transactional 
query is issued, Parliament displays all statement versions that match the 
following criteria:
  - The statement’s creation ID is a committed transaction and is less than the 
current transaction counter
  - The statement lacks an expiration ID, its expiration ID was in process at 
query start, or the expiration ID is greater than the current transaction 
counter

One helpful result of structuring our statement iterator operation in this 
manner is that we can align our physical data storage with the abstract RDF 
concept that a resource can be said to only exist if there exists a statement 
that uses that resource.  If we ensure that the only canonical source for 
determining the existence of a resource is the Statement List, then we are free 
to loosen the isolation level of the resource dictionary to that of 
UNCOMMITTED_READ while retaining an overall SERIALIZABLE isolation for the 
triple store itself (thus there is no requirement for MVCC on the resource 
dictionary, and minimal locking requirements).

Because there is no coordination between read and write transactions, it is not 
easy to know exactly when a particular expired statement can safely be deleted 
from the statement list.  Borrowing from other MVCC implementations (PostgreSQL 
in particular), Parliament will use a VACUUM task that will iterate through all 
statements and remove any links to statements that are considered expired by 
all open transactions (possibly placing the address in a free-space list for 
reuse).

Right now I am going to focus on an implementation with only a single writer, 
but with the idea that multiple writers is possible in the future.  It that 
aspect that drives me to suspect that I will want more isolation levels than 
SERIALIZABLE as lower levels could reduce potentially expensive locking.  It 
could also be useful for improving the throughput of lots of short read queries 
as well.  UNCOMMITTED_READ for example could avoid the locking needed to get a 
list of all in-process transactions and instead atomically get and increment 
the current transaction counter.

Some of these isolation levels may only make sense if the transaction interface 
could specify the start and end of a query inside of a transaction since at the 
DatasetGraph level all you get are .find(Quad) operations.  The example I gave 
on the JIRA would need some way of indicating query start inside of a 
transaction.


TxTDB
-----

It sounds like you may be describing shadow paging [1].  In shadow paging, the 
basic idea is that any reader or writer gets a view onto the data based on 
reachability from pointers in a particular root block.  Pages that are 
reachable from any live root block are never modified.  A vacuum process is 
required to collect the space from blocks that are no longer reachable (the 
last reader out could do it too).  Updates to indexes must be treated in 
roughly the same way as data pages, because they contain pointers to different 
data.  If you wrote each block to disk at commit time the system would belong 
to the class of logless transaction systems (although a small log might be 
needed to prevent partial updates to the root page pointers).  With a shadow 
paging system you also would not have to actually copy the new pages back over 
top of the old ones.


-Stephen

[1] http://en.wikipedia.org/wiki/Shadow_paging


-----Original Message-----
From: Andy Seaborne [mailto:[email protected]] 
Sent: Tuesday, March 22, 2011 11:06 AM
To: [email protected]
Subject: Different policy for concurrency access in TDB supporting a single 
writer and multiple readers

Switching to email for general discussion:

Stephen,

Could you say something about your design ideas for transactions and 
concurrency in Parliament?  And specifically, what is the unit of MVCC?

I've mainly been thinking about the design for TDB, and a design that 
uses blocks as the unit of transaction and MVCC, with each writer seeing 
a view of the block space that accords with its initial time point

Using a WAL as a redo log, the changes are not in the main database 
until the WAL is checkpointed.

If you have the time sequence:

R1: start reader 1
W1: start writer 1
W1: finish writer 1
# a reader here is like R1
R2: start reader 2
W2: start writer 2
W2: finish writer 2
R3: start reader 3
... finish queries ...

which is single writer but reader 1 spans two writers.

My idea has been to create an "IP layer for data" (IP as in Internet 
Protocol) over blocks, rather that making the data structures above that 
layer (B+Trees, Extensible hash tables) versioned - the data structure 
code has to be slightly version aware to pass the version context up and 
down to the block layer but the algorithms are essentially not version 
specific.  This depends on making the block layer sufficiently efficient 
which might become the practical transaction size limitation. 
Compression, or recording operations on block according to type (an 
extreme form of application-driven compression), would be very good.  It 
retains the data publishing-focus of TDB - that is, reader and reader 
performance is primary and costs go mainly on the writers.

Only experimentation will tell.  It is my hope to have other data 
structures for indexing in the future so a clean separation of the 
tranasaction/MVCC issues and the data structure algorithms is going to 
be good.

Given that design, SERIALIZABLE is the natural isolation policy.  There 
is very little locking over blocks in the main database (see below for 
the possibility of none!)

That's not to say that the API design should not have these constants in 
even if not used.

It happens naturally, each writer sees its modified blocks (as written 
to the WAL) and the blocks of past committed writers + the base 
database.  Readers just don't get changed blocks; reader 1 is working 
agains the main database; reader 2 gets modified blocks from the WAL 
because the WAL can't be all written to the main database until reader 1 
gets out the way.

READ_COMMITTED makes sense for limiting growth of the WAL as it means 
that the WAL can be flushed to the main database before an earlier 
reader has finished.  It would have to be done in such a way that all 
the changes happen at once or the data structure invariants could be 
invalid; that is, the reader would also have to be outside such an index 
as well and that needs some locking that would not be there otherwise.

REPEATABLE_READ would need block locking for the updates but seems to 
have a quirk. Unfortunately, just knowing which blocks the reader has 
seen may not be strong enough as a change to the blocks of an index may 
result in a change to a block the reader has looked at and one it has 
not in some data structure consistency way, but the block locking will 
miss that.  Not sure that's safe for the B+trees - it's not in general.

READ_UNCOMMITTED is very likely to break the index structures as the 
higher level assumptions in the B+trees will not be valid.

A feature of the TxTDB MVCC design is that locking is not being done on 
the main database.  Changes only happen when writing back committed 
blocks, and then you know the original block to be over-written is never 
seen by a reader.  Write-back occurs only when all readers upto that 
writer version have exited and so any outstanding reader will be getting 
the updated block via the WAL copy. No block lock needed (a bit scary 
that bit :-).

Recovery is a matter of redoing the WAL so it is potentially more 
expensive than an undo log where the undo actions are just running 
transactions, not a period of recent history of writers.  But we don't 
crash do we? :-)

I think that limiting the number of versions back that will be admitted 
to some control constant (like 10 or 20, ideally adapting to few for 
larger tranactions) will be appropriate.  if it's growing, then it's a 
sign the system is being asked to do more than it has the processing 
power to do.

Most of the time, the WAL is write-only, and there are in-memory copies 
of blocks (AKA a cache, with version tiers). This is analogous to the 
direct-mode write cache there is currently, but with an additional 
write-back policy as it needs to sync when a transaction commits.

        Andy

On 21/03/11 22:09, Stephen Allen (JIRA) wrote:
>
> [
> https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009433#comment-13009433
> ]
>
> Stephen Allen commented on JENA-41:
> -----------------------------------
>
> I think your idea about the DatasetGraph being the interface for
> transactions makes sense.  Transactional DatasetGraphs could also
> provide fallback behavior for legacy code by implementing autocommit
> transactions if the user called methods on a dataset that was not
> initialized in a transactionBegin() call.
>
>
> With regard to the isolation levels, I believe some of the lower
> levels can make sense for particular applications or queries.  For
> example say you want to know the size of a few of graphs.
>
>BEGIN READ_ONLY;
 >select count (*) where { graph<http://example/g1> { ?s ?p ?o . } } ;
 >select count (*) where { graph<http://example/g2> { ?s ?p ?o . } } ;
 >COMMIT;
>
> Assuming a traditional pessimistic locking scheme, running the
> transaction at SERIALIZABLE could cause the locks held by the first
> select query to also be held through the second query, reducing
> concurrency (using two transactions instead might not be a good idea
> as there is usually some amount of overhead associated with creating
> and committing transactions).
>
> If you were OK with the possibility that the two query results are
> not truly serializable with respect to each other, then you could
> improve concurrency by using a READ_COMMITTED isolation level instead
> that would give serializable results for each query (but not the
> whole transaction).  And if you really just needed a rough estimate
> of size, using READ_UNCOMMITTED may be able to avoid locking all
> together.
>
> An additional motivating factor for MVCC implementations is that they
> may be implementing snapshot isolation, which probably maps better to
> READ_COMMITTED than SERIALIZABLE (especially if it could do predicate
> locking for true serializable behavior but allow cheaper snapshot
> isolation if that was all that was needed).  The Postgres
> documentation does a good job of describing this [1].
>
> I would find it useful to have multiple isolation levels available
> (even if internally I'm mapping them all to SERIALIZABLE at first).
> The four ANSI Isolation levels seem appropriate, and remember that
> implementations are allowed to map unavailable lower levels to higher
> levels as desired.
>
>
> [1]
> http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html

Reply via email to