In fact, this is why I tried (for a first try) a design with only one
transaction committing at a time, which amounts to SW in terms of
serializability, I thought. But I am allowing multiple writers to assemble
changes in multiple transactions at the same time, and I think that is what
will prevent the use of swap-into-commit. Maybe this is a bad trade? Since
JENA-624 contemplates very high concurrency, is it worth doing a MR+SW design
at all? But MRMW seems very hard. {grin}
I had some ideas about structuring indexes in such a way as to allow for more
fine-grained locking and using merge for actual MW, but as you point out,
locking down to particular resources is not able to guarantee against conflicts
between conceptual entities. I also had some nightmares trying to think about
how to manage bnodes across multiple writers.
---
A. Soroka
The University of Virginia Library
On Aug 28, 2015, at 6:17 AM, Andy Seaborne <[email protected]> wrote:
> On 27/08/15 16:53, [email protected] wrote:
>> Andy-- Thanks, these comments are really helpful! I've replied
>> in-line in a few places to clarify or answer questions, or ask some
>> of my own. {grin}
>>
>> --- A. Soroka The University of Virginia Library
>>
>
>>> If there are multiple writers, then (1) system aborts will always
>>> be possible (conflicting updates) and (2) locking on datastructres
>>> is necessary ... or timestamps and vector clocks or some such.
>>
>> Right, see below. Again, there are multiple writers, but they only
>> see themselves, and only one committer. "Only one committer at a
>> time" prevents conflicts, since there is no schema to violate, but it
>> is a brutal way to deal with the problem. And the "re-run" scheme of
>> operation means it will be a very real bottleneck.
>>
>>>> 5) Snapshot isolation. Transactions do not see commits that occur
>>>> during their lifetime. Each works entirely from the state of the
>>>> DatasetGraph at the start of its life.
>>> But they see their own updates presumably?
>>
>> Right, that's exactly the purpose of taking off their own reference
>> to the persistent datastructures at the start of the transaction.
>> They "evolve" their datastructures independently.
>
> When used in a program, persistent datastructures diverge when two writes act
> from the same base point.
>
> Transactions do more - they are serializing all operations so there is a
> linear sequence of versions. This is the problem you identify below.
>
>>>> 6) Only as many as one transaction per thread, for now.
>>>> Transactions are not thread-safe. These are simplifying
>>>> assumptions that could be relaxed later.
>>>
>>> TDB ended up there as well. There is, internally, a transaction
>>> object but it's held in a ThreadLocal and fetched when needed.
>>> Otherwise a lot of interface need a "transaction" parameter and its
>>> hard to reuse other code that does pass it through.
>>
>> That's close to what I sketched out.
>>
>>> I have taken a second take on transactions with TDB2. This module
>>> is an independent transactions system, unlike TDB1 where it's
>>> TDB1-specific.
>>> https://github.com/afs/mantis/tree/master/dboe-transaction It needs
>>> documentation for use on its own but I have used in in another
>>> project to coordinate distributed transactions. (dboe = database
>>> operating environment)
>>
>> I need to study this more. Obviously, if I can take over some of your
>> work, that would be ideal.
>>
>>>> My current design operates as follows: <snipped>
>>> Looks good. I don't quite understand the need to record and rerun
>>> though - isn't the power of pcollections that there can be old and
>>> new roots to the datastructures and commit is swap to new one,
>>> abort is forget the new one.
>>
>> Yeah, but my worry (perhaps just my misunderstanding) is over
>> transactions interacting badly in the presence of snapshot isolation.
>> Let's say we did use the technique of atomic swap, and consider the
>> following scenario:
>>
>>
>> T=-1 The committed datastructures contain triples T.
>> T=0 Transaction 1 begins, taking a reference to the datastructures
>> T=1 Transaction 2 begins, taking its own reference to the datastructures
>> T=3 Transaction 1 does some updates, adding some triples T_1 to its own
>> "branch", resulting in T+T_1.
>> T=4 Transaction 2 does some updates, adding some triples T_2 to its own
>> "branch", resulting in T+T_2.
>> T=5 Transaction 1 commits, so that the committed triples are now T + T_1.
>> T=6 Transaction 2 commits, so that the committed triples are now T + T_2.
>
>>
>> We lost Transaction 1's T_1 triples. I think this technique actually
>> requires _merge_ instead of swap, either merge-into-open-transactions
>> (after a commit) which isn't snapshot isolation or merge-into-commit
>> (instead of swap-into-commit). But there's plenty of chance that I'm
>> just misunderstanding this whole thing. {grin} I have not designed a
>> transaction system over persistent datastructures before, and I
>> welcome correction. I also need to research more about persistent
>> datastructures with merge capability.
>
> which is why 2+ writers needs locking or aborts.
>
>
> The common ACID example:
>
> Start with:
>
> :account :balance 10 .
>
> W1 (adds 5 to the account)
>
> Delete
> :account :balance 10 .
> Insert
> :account :balance 15 .
>
> W2 (adds 7 to the account)
>
> Delete
> :account :balance 10 .
> Insert
> :account :balance 17 .
>
>
> Oh dear. No amount of merge or swap will work.
> Either W2 (or W1) is aborted or you get inconsistency.
>
> If you really, really want true parallel writers, then you'll need more than
> to rerun with a fixed resolution algorithm. It is hard enough in RDF to even
> detect there is a conflict.
>
> An application transaction deciding itself to abort is rare so most
> overlapping writers will both commit - all the work of one is always going to
> be lost, presumable with retry and that means the app writer getting involved.
>
> In an SQL database, a row lock on the account resolves the problem. But
> there isn't something in the data that is the same as the row in SQL.
>
> :account :balance [ :currency "USD" ; :value 10 ] .
>
> so locking ":account" does not work. A conceptual entity isn't tied to a
> single graph node.
>
> Single true-writer does not suffer from this. Or dirty reads. Or phantom
> reads. (these require thread locking and don't work with persistent
> datastructures).
>
> But multiple-true writers aren't a common use case - multiple readers are.
> MR+SW lets readers proceed at any time without blocking; writers never system
> abort.
>
> Andy