Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

[email protected] Fri, 28 Aug 2015 04:23:07 -0700

In fact, this is why I tried (for a first try) a design with only one 
transaction committing at a time, which amounts to SW in terms of 
serializability, I thought. But I am allowing multiple writers to assemble 
changes in multiple transactions at the same time, and I think that is what 
will prevent the use of swap-into-commit. Maybe this is a bad trade? Since 
JENA-624 contemplates very high concurrency, is it worth doing a MR+SW design 
at all? But MRMW seems very hard. {grin}


I had some ideas about structuring indexes in such a way as to allow for more 
fine-grained locking and using merge for actual MW, but as you point out, 
locking down to particular resources is not able to guarantee against conflicts 
between conceptual entities. I also had some nightmares trying to think about 
how to manage bnodes across multiple writers.

---
A. Soroka
The University of Virginia Library

On Aug 28, 2015, at 6:17 AM, Andy Seaborne <[email protected]> wrote:

> On 27/08/15 16:53, [email protected] wrote:
>> Andy-- Thanks, these comments are really helpful! I've replied
>> in-line in a few places to clarify or answer questions, or ask some
>> of my own. {grin}
>> 
>> --- A. Soroka The University of Virginia Library
>> 
> 
>>> If there are multiple writers, then (1) system aborts will always
>>> be possible (conflicting updates) and (2) locking on datastructres
>>> is necessary ... or timestamps and vector clocks or some such.
>> 
>> Right, see below. Again, there are multiple writers, but they only
>> see themselves, and only one committer. "Only one committer at a
>> time" prevents conflicts, since there is no schema to violate, but it
>> is a brutal way to deal with the problem. And the "re-run" scheme of
>> operation means it will be a very real bottleneck.
>> 
>>>> 5) Snapshot isolation. Transactions do not see commits that occur
>>>> during their lifetime. Each works entirely from the state of the
>>>> DatasetGraph at the start of its life.
>>> But they see their own updates presumably?
>> 
>> Right, that's exactly the purpose of taking off their own reference
>> to the persistent datastructures at the start of the transaction.
>> They "evolve" their datastructures independently.
> 
> When used in a program, persistent datastructures diverge when two writes act 
> from the same base point.
> 
> Transactions do more - they are serializing all operations so there is a 
> linear sequence of versions.  This is the problem you identify below.
> 
>>>> 6) Only as many as one transaction per thread, for now.
>>>> Transactions are not thread-safe. These are simplifying
>>>> assumptions that could be relaxed later.
>>> 
>>> TDB ended up there as well.  There is, internally, a transaction
>>> object but it's held in a ThreadLocal and fetched when needed.
>>> Otherwise a lot of interface need a "transaction" parameter and its
>>> hard to reuse other code that does pass it through.
>> 
>> That's close to what I sketched out.
>> 
>>> I have taken a second take on transactions with TDB2.  This module
>>> is an independent transactions system, unlike TDB1 where it's
>>> TDB1-specific.
>>> https://github.com/afs/mantis/tree/master/dboe-transaction It needs
>>> documentation for use on its own but I have used in in another
>>> project to coordinate distributed transactions. (dboe = database
>>> operating environment)
>> 
>> I need to study this more. Obviously, if I can take over some of your
>> work, that would be ideal.
>> 
>>>> My current design operates as follows: <snipped>
>>> Looks good.  I don't quite understand the need to record and rerun
>>> though - isn't the power of pcollections that there can be old and
>>> new roots to the datastructures and commit is swap to new one,
>>> abort is forget the new one.
>> 
>> Yeah, but my worry (perhaps just my misunderstanding) is over
>> transactions interacting badly in the presence of snapshot isolation.
>> Let's say we did use the technique of atomic swap, and consider the
>> following scenario:
>> 
>> 
>> T=-1  The committed datastructures contain triples T.
>> T=0   Transaction 1 begins, taking a reference to the datastructures
>> T=1   Transaction 2 begins, taking its own reference to the datastructures
>> T=3   Transaction 1 does some updates, adding some triples T_1 to its own 
>> "branch", resulting in T+T_1.
>> T=4   Transaction 2 does some updates, adding some triples T_2 to its own 
>> "branch", resulting in T+T_2.
>> T=5   Transaction 1 commits, so that the committed triples are now T + T_1.
>> T=6   Transaction 2 commits, so that the committed triples are now T + T_2.
> 
>> 
>> We lost Transaction 1's T_1 triples. I think this technique actually
>> requires _merge_ instead of swap, either merge-into-open-transactions
>> (after a commit) which isn't snapshot isolation or merge-into-commit
>> (instead of swap-into-commit). But there's plenty of chance that I'm
>> just misunderstanding this whole thing. {grin} I have not designed a
>> transaction system over persistent datastructures before, and I
>> welcome correction. I also need to research more about persistent
>> datastructures with merge capability.
> 
> which is why 2+ writers needs locking or aborts.
> 
> 
> The common ACID example:
> 
> Start with:
> 
>  :account :balance 10 .
> 
> W1 (adds 5 to the account)
> 
> Delete
>  :account :balance 10 .
> Insert
>  :account :balance 15 .
> 
> W2 (adds 7 to the account)
> 
> Delete
>  :account :balance 10 .
> Insert
>  :account :balance 17 .
> 
> 
> Oh dear.  No amount of merge or swap will work.
> Either W2 (or W1) is aborted or you get inconsistency.
> 
> If you really, really want true parallel writers, then you'll need more than 
> to rerun with a fixed resolution algorithm. It is hard enough in RDF to even 
> detect there is a conflict.
> 
> An application transaction deciding itself to abort is rare so most 
> overlapping writers will both commit - all the work of one is always going to 
> be lost, presumable with retry and that means the app writer getting involved.
> 
> In an SQL database, a row lock on the account resolves the problem.  But 
> there isn't something in the data that is the same as the row in SQL.
> 
> :account :balance [ :currency "USD" ; :value 10 ] .
> 
> so locking ":account" does not work.  A conceptual entity isn't tied to a 
> single graph node.
> 
> Single true-writer does not suffer from this.  Or dirty reads.  Or phantom 
> reads. (these require thread locking and don't work with persistent 
> datastructures).
> 
> But multiple-true writers aren't a common use case - multiple readers are.  
> MR+SW lets readers proceed at any time without blocking; writers never system 
> abort.
> 
>       Andy

Re: JENA-624: "Develop a new in-memory RDF Dataset implementation"

Reply via email to