Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Andy Seaborne Fri, 28 Aug 2015 03:18:13 -0700

On 27/08/15 16:53, [email protected] wrote:

Andy-- Thanks, these comments are really helpful! I've replied
in-line in a few places to clarify or answer questions, or ask some
of my own. {grin}


--- A. Soroka The University of Virginia Library

If there are multiple writers, then (1) system aborts will always
be possible (conflicting updates) and (2) locking on datastructres
is necessary ... or timestamps and vector clocks or some such.


Right, see below. Again, there are multiple writers, but they only
see themselves, and only one committer. "Only one committer at a
time" prevents conflicts, since there is no schema to violate, but it
is a brutal way to deal with the problem. And the "re-run" scheme of
operation means it will be a very real bottleneck.

5) Snapshot isolation. Transactions do not see commits that occur
during their lifetime. Each works entirely from the state of the
DatasetGraph at the start of its life.

But they see their own updates presumably?


Right, that's exactly the purpose of taking off their own reference
to the persistent datastructures at the start of the transaction.
They "evolve" their datastructures independently.

When used in a program, persistent datastructures diverge when twowrites act from the same base point.

Transactions do more - they are serializing all operations so there is alinear sequence of versions. This is the problem you identify below.

6) Only as many as one transaction per thread, for now.
Transactions are not thread-safe. These are simplifying
assumptions that could be relaxed later.


TDB ended up there as well.  There is, internally, a transaction
object but it's held in a ThreadLocal and fetched when needed.
Otherwise a lot of interface need a "transaction" parameter and its
hard to reuse other code that does pass it through.


That's close to what I sketched out.

I have taken a second take on transactions with TDB2.  This module
is an independent transactions system, unlike TDB1 where it's
TDB1-specific.
https://github.com/afs/mantis/tree/master/dboe-transaction It needs
documentation for use on its own but I have used in in another
project to coordinate distributed transactions. (dboe = database
operating environment)


I need to study this more. Obviously, if I can take over some of your
work, that would be ideal.

My current design operates as follows: <snipped>

Looks good.  I don't quite understand the need to record and rerun
though - isn't the power of pcollections that there can be old and
new roots to the datastructures and commit is swap to new one,
abort is forget the new one.


Yeah, but my worry (perhaps just my misunderstanding) is over
transactions interacting badly in the presence of snapshot isolation.
Let's say we did use the technique of atomic swap, and consider the
following scenario:


T=-1  The committed datastructures contain triples T.
T=0   Transaction 1 begins, taking a reference to the datastructures
T=1   Transaction 2 begins, taking its own reference to the datastructures
T=3   Transaction 1 does some updates, adding some triples T_1 to its own 
"branch", resulting in T+T_1.
T=4   Transaction 2 does some updates, adding some triples T_2 to its own 
"branch", resulting in T+T_2.
T=5   Transaction 1 commits, so that the committed triples are now T + T_1.
T=6   Transaction 2 commits, so that the committed triples are now T + T_2.


We lost Transaction 1's T_1 triples. I think this technique actually
requires _merge_ instead of swap, either merge-into-open-transactions
(after a commit) which isn't snapshot isolation or merge-into-commit
(instead of swap-into-commit). But there's plenty of chance that I'm
just misunderstanding this whole thing. {grin} I have not designed a
transaction system over persistent datastructures before, and I
welcome correction. I also need to research more about persistent
datastructures with merge capability.


which is why 2+ writers needs locking or aborts.


The common ACID example:

Start with:

  :account :balance 10 .

W1 (adds 5 to the account)

Delete
  :account :balance 10 .
Insert
  :account :balance 15 .

W2 (adds 7 to the account)

Delete
  :account :balance 10 .
Insert
  :account :balance 17 .


Oh dear.  No amount of merge or swap will work.
Either W2 (or W1) is aborted or you get inconsistency.

If you really, really want true parallel writers, then you'll need morethan to rerun with a fixed resolution algorithm. It is hard enough inRDF to even detect there is a conflict.

An application transaction deciding itself to abort is rare so mostoverlapping writers will both commit - all the work of one is alwaysgoing to be lost, presumable with retry and that means the app writergetting involved.

In an SQL database, a row lock on the account resolves the problem. Butthere isn't something in the data that is the same as the row in SQL.


:account :balance [ :currency "USD" ; :value 10 ] .

so locking ":account" does not work. A conceptual entity isn't tied toa single graph node.

Single true-writer does not suffer from this. Or dirty reads. Orphantom reads. (these require thread locking and don't work withpersistent datastructures).

But multiple-true writers aren't a common use case - multiple readersare. MR+SW lets readers proceed at any time without blocking; writersnever system abort.


        Andy

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to