Hi Henrik,

Welcome, and thanks for the feedback.

> I hope it's ok to use this list for comments on the whitepaper?

Of course, but we may have to be selective in our back-and-forth. We can always 
take some discussion off-list to keep it manageable.

> if in addition to a deadline you also impose some upper bound for the maximum 
> allowed timestamp

I expect that, much like with LWTs, there will be no facility for user-provided 
timestamps with these transactions. But yes, I anticipate many knock-on 
improvements for tables that are managed with this transaction facility.

> The algorithm is hard to read since you omit the roles of the participants.

Thanks. I will consider how I might make it clearer that the portions of the 
algorithm that execute on receipt of messages that may only be received by 
replicas, are indeed executed by those replicas.

> Is this sentence correct?

Yes, but perhaps it may be made clearer. In a previous draft there was an 
additional upsilon variable that likely clarified, but in this location for 
consistency this is hard to use (as it would replace tau, which is already 
bound by wider context), and for consistency I have tried to ensure gamma < tau 
< upsilon throughout the paper.

> Proofs of theorems 3.1 and 3.2 appear to be identical?

Nope. There’s a single but important digit difference.

>* Are interactive transactions possible?

No, I don’t think this protocol can be easily made to natively support 
interactive transactions, even discounting the problems you highlight - but I 
haven’t thought about it much as it was not a goal. Interactive transactions 
can certainly be built on top.

> Are the results of the Jepsen testing available too? (Or will be?)

There are no publishable results, nor any intention to publish them. There is a 
(fairly rough) implementation of the Jepsen.io Maelstrom txn-append workload 
that you may run at your leisure in the prototype repository. The in-tree 
strict serializability verifier is in all honesty probably more useful today 
and is I think functionally equivalent. You are welcome to browse and run both. 
As things progress towards completion, if Kyle is interested or funding can be 
found I’d love to discuss the possibility of an in-depth Jepsen analysis that 
could be published, but that’s a totally separate conversation and I think very 
premature.

> So I guess my question is how and when reads happen?

I think this is reasonably well specified in the protocol and, since it’s 
unclear what you’ve found confusing, I don’t know it would be productive to try 
to explain it again here on list. You can look at the prototype, if Java is 
easier for you to parse, as it is of course fully specified there with no 
ambiguity. Or we can discuss off list, or perhaps on the community slack 
channel.


From: Henrik Ingo <henrik.i...@datastax.com>
Date: Monday, 6 September 2021 at 19:08
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi all

I should start by briefly introducing myself: I've worked a year plus at
Datastax, but in a manager role. I have no expectations near term to
actually contribute code or docs to Cassandra, rather I hope my work
indirectly will enable others to do so. As such I also don't expect to be
very vocal on this list, but today seemed like a perfect day to make that
one exception! I hope that's ok?

Before joining the Cassandra world I've worked at MongoDB and several
companies in the MySQL ecosystem. If you read the Raft mailing list you
will have met me there. Since my focus was always on high availability and
performance, I've felt very much at home working in the Cassandra ecosystem.



To the authors of the white paper I want to say this is very inspiring
work. I agree it is time to bring general purpose transactions to
Cassandra, and you are introducing them in a way that builds upon
Cassandra's existing Dynamo protocol with natural timestamps. When I was
learning Cassandra 16 months ago I had similar thoughts to what you are now
presenting.

I hope it's ok to use this list for comments on the whitepaper?

1. Introduction

While I agree that cross shard transactions are only recently becoming
mainstream, for academic level accuracy of your paper you may want to
reference NDB, also known as MySQL NDB Cluster.
 * https://en.wikipedia.org/wiki/MySQL_Cluster
 * http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.884

Above thesis is from 1997 and MySQL acquired the technology for 1 dollar in
2004. Since shortly after that year it has been in widespread use in
our mobile phone networks, with some early e-commerce and OLAP/ML type use
as secondary use cases. In short, NDB provides cross shard transactions
simply via 2 PC. A curious detail of the design is that it actually does
both replication and cross-shard both via 2PC. Two of the participants just
happen to be replicas of each other.



2.2 Timestamp Reorder buffer

It's probably the case this is obvious, and it's omitted because it's not
required by ACCORD, but I wanted to add here that if in addition to a
deadline you also impose some upper bound for the maximum allowed
timestamp, you will make all our issues with tombstones from the future go
away. (And since you are now creating an ordered commit log, this will also
avoid having to keep tombstones for 10 days, simplify anti-entropy for
failed nodes, etc...)

3.2 Consensus

The algorithm is hard to read since you omit the roles of the participants.
It's as if all of it was executed on the Coordinator.

Is this sentence correct? Probably it is and I'm at the limits of my
understanding... *"Note that any transitive dependency of another γ ∈depsτ
where Committedγ may be pruned from depsτ, as it is durably a transitive
dependency of τ."*



3.4 Safety

Proofs of theorems 3.1 and 3.2 appear to be identical?

End:

Ok so reads were discussed very briefly in 3.3, leaving the reader to guess
quite a lot...

* Are interactive transactions possible? It appears they could be, even if
Algorithm 2 only allows for one pass at reads.
* Do I understand correctly that t0 is essentially both the start and end
time of the transaction? ...and that serializability is provided by the
fact that a later transaction gamma will not even start to execute reads
before earlier transaction tau has committed?
* If interactive transactions are possible, it seems a client can
denial-of-service a row by never committing, keeping locks open forever?

So I guess my question is how and when reads happen?

More precisely... how is it possible that the Consensus protocol is
executed first, and it already knows its dependencies, even if the
Execution protocol - aka reads and writes - are only executed after?

Similarly, how do you expect to apply writes before reads were returned to
the client? Even if you were proposing some Calvin-like single-shot
transaction, it still begs the question what mechanism can consume read
results and based on those impact the writes?


Reading the CEP:

Are the results of the Jepsen testing available too? (Or will be?)


henrik

On Sun, Sep 5, 2021 at 5:33 PM bened...@apache.org <bened...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>


--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Reply via email to