Sound good to me. look forward to the detailed proposal. (I don't mind the format if it makes things easier to you)
Sijie On Friday, October 14, 2016, Xi Liu <xi.liu....@gmail.com> wrote: > Thank you, Sijie > > We have some internal discussions to sort out some details. We are ready to > collaborate with the community for adding the transaction support in DL. > We'd like to share more. > > I created a proposal wiki here - > https://cwiki.apache.org/confluence/display/DL/DP-1+-+ > DistributedLog+Transaction+Support > > (I followed KIP format and named it as DP (DistributedLog Proposal - DP is > also short for Dynamic Programming). I don't know if you guys like this > name or not. Feel free to change it :D) > > I basically put my initial email as the content there so far. Once we > finished our final discussion, I will update with more details. At the same > time, any comments are welcome. > > - Xi > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <si...@apache.org <javascript:;>> > wrote: > > > Xi, > > > > I just granted you the edit permission. > > > > - Sijie > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu....@gmail.com > <javascript:;>> wrote: > > > > > I still can not edit the wiki. Can any of the pmc members grant me the > > > permissions? > > > > > > - Xi > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi.liu....@gmail.com > <javascript:;>> wrote: > > > > > > > Sijie, > > > > > > > > I attempted to create a wiki page under that space. I found that I am > > not > > > > authorized with edit permission. > > > > > > > > Can any of the committers grant me the wiki edit permission? My > account > > > is > > > > "xi.liu.ant". > > > > > > > > - Xi > > > > > > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <si...@apache.org > <javascript:;>> wrote: > > > > > > > >> This sounds interesting ... I will take a closer look and give my > > > comments > > > >> later. > > > >> > > > >> At the same time, do you mind creating a wiki page to put your idea > > > there? > > > >> You can add your wiki page under > > > >> https://cwiki.apache.org/confluence/display/DL/Project+Proposals > > > >> > > > >> You might need to ask in the dev list to grant the wiki edit > > permissions > > > >> to > > > >> you once you have a wiki account. > > > >> > > > >> - Sijie > > > >> > > > >> > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi.liu....@gmail.com > <javascript:;>> wrote: > > > >> > > > >> > Hello, > > > >> > > > > >> > I asked the transaction support in distributedlog user group two > > > months > > > >> > ago. I want to raise this up again, as we are looking for using > > > >> > distributedlog for building a transactional data service. It is a > > > major > > > >> > feature that is missing in distributedlog. We have some ideas to > add > > > >> this > > > >> > to distributedlog and want to know if they make sense or not. If > > they > > > >> are > > > >> > good, we'd like to contribute and develop with the community. > > > >> > > > > >> > Here are the thoughts: > > > >> > > > > >> > ------------------------------------------------- > > > >> > > > > >> > From our understanding, DL can provide "at-least-once" delivery > > > semantic > > > >> > (if not, please correct me) but not "exactly-once" delivery > > semantic. > > > >> That > > > >> > means that a message can be delivered one or more times if the > > reader > > > >> > doesn't handle duplicates. > > > >> > > > > >> > The duplicates come from two places, one is at writer side (this > > > assumes > > > >> > using write proxy not the core library), while the other one is at > > > >> reader > > > >> > side. > > > >> > > > > >> > - writer side: if the client attempts to write a record to the > write > > > >> > proxies and gets a network error (e.g timeouts) then retries, the > > > >> retrying > > > >> > will potentially result in duplicates. > > > >> > - reader side:if the reader reads a message from a stream and then > > > >> crashes, > > > >> > when the reader restarts it would restart from last known position > > > >> (DLSN). > > > >> > If the reader fails after processing a record and before recording > > the > > > >> > position, the processed record will be delivered again. > > > >> > > > > >> > The reader problem can be properly addressed by making use of the > > > >> sequence > > > >> > numbers of records and doing proper checkpointing. For example, in > > > >> > database, it can checkpoint the indexed data with the sequence > > number > > > of > > > >> > records; in flink, it can checkpoint the state with the sequence > > > >> numbers. > > > >> > > > > >> > The writer problem can be addressed by implementing an idempotent > > > >> writer. > > > >> > However, an alternative and more powerful approach is to support > > > >> > transactions. > > > >> > > > > >> > *What does transaction mean?* > > > >> > > > > >> > A transaction means a collection of records can be written > > > >> transactionally > > > >> > within a stream or across multiple streams. They will be consumed > by > > > the > > > >> > reader together when a transaction is committed, or will never be > > > >> consumed > > > >> > by the reader when the transaction is aborted. > > > >> > > > > >> > The transaction will expose following guarantees: > > > >> > > > > >> > - The reader should not be exposed to records written from > > uncommitted > > > >> > transactions (mandatory) > > > >> > - The reader should consume the records in the transaction commit > > > order > > > >> > rather than the record written order (mandatory) > > > >> > - No duplicated records within a transaction (mandatory) > > > >> > - Allow interleaving transactional writes and non-transactional > > writes > > > >> > (optional) > > > >> > > > > >> > *Stream Transaction & Namespace Transaction* > > > >> > > > > >> > There will be two types of transaction, one is Stream level > > > transaction > > > >> > (local transaction), while the other one is Namespace level > > > transaction > > > >> > (global transaction). > > > >> > > > > >> > The stream level transaction is a transactional operation on > writing > > > >> > records to one stream; the namespace level transaction is a > > > >> transactional > > > >> > operation on writing records to multiple streams. > > > >> > > > > >> > *Implementation Thoughts* > > > >> > > > > >> > - A transaction is consist of begin control record, a series of > data > > > >> > records and commit/abort control record. > > > >> > - The begin/commit/abort control record is written to a `commit` > log > > > >> > stream, while the data records will be written to normal data log > > > >> streams. > > > >> > - The `commit` log stream will be the same log stream for > > stream-level > > > >> > transaction, while it will be a *system* stream (or multiple > system > > > >> > streams) for namespace-level transactions. > > > >> > - The transaction code looks like as below: > > > >> > > > > >> > <code> > > > >> > > > > >> > Transaction txn = client.transaction(); > > > >> > Future<DLSN> result1 = txn.write(stream-0, record); > > > >> > Future<DLSN> result2 = txn.write(stream-1, record); > > > >> > Future<DLSN> result3 = txn.write(stream-2, record); > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit(); > > > >> > > > > >> > </code> > > > >> > > > > >> > if the txn is committed, all the write futures will be satisfied > > with > > > >> their > > > >> > written DLSNs. if the txn is aborted, all the write futures will > be > > > >> failed > > > >> > together. there is no partial failure state. > > > >> > > > > >> > - The actually data flow will be: > > > >> > > > > >> > 1. writer get a transaction id from the owner of the `commit' log > > > stream > > > >> > 1. write the begin control record (synchronously) with the > > transaction > > > >> id > > > >> > 2. for each write within the same txn, it will be assigned a local > > > >> sequence > > > >> > number starting from 0. the combination of transaction id and > local > > > >> > sequence number will be used later on by the readers to > de-duplicate > > > >> > records. > > > >> > 3. the commit/abort control record will be written based on the > > > results > > > >> > from 2. > > > >> > > > > >> > - Application can supply a timeout for the transaction when > > #begin() a > > > >> > transaction. The owner of the `commit` log stream can abort > > > transactions > > > >> > that never be committed/aborted within their timeout. > > > >> > > > > >> > - Failures: > > > >> > > > > >> > * all the log records can be simply retried as they will be > > > >> de-duplicated > > > >> > probably at the reader side. > > > >> > > > > >> > - Reader: > > > >> > > > > >> > * Reader can be configured to read uncommitted records or > committed > > > >> records > > > >> > only (by default read uncommitted records) > > > >> > * If reader is configured to read committed records only, the read > > > ahead > > > >> > cache will be changed to maintain one additional pending committed > > > >> records. > > > >> > the pending committed records map is bounded and records will be > > > dropped > > > >> > when read ahead is moving. > > > >> > * when the reader hits a commit record, it will rewind to the > begin > > > >> record > > > >> > and start reading from there. leveraging the proper read ahead > cache > > > and > > > >> > pending commit records cache, it would be good for both short > > > >> transactions > > > >> > and long transactions. > > > >> > > > > >> > - DLSN, SequenceId: > > > >> > > > > >> > * We will add a fourth field to DLSN. It is `local sequence > number` > > > >> within > > > >> > a transaction session. So the new DLSN of records in a transaction > > > will > > > >> be > > > >> > the DLSN of commit control record plus its local sequence number. > > > >> > * The sequence id will be still the position of the commit record > > plus > > > >> its > > > >> > local sequence number. The position will be advanced with total > > number > > > >> of > > > >> > written records on writing the commit control record. > > > >> > > > > >> > - Transaction Group & Namespace Transaction > > > >> > > > > >> > using one single log stream for namespace transaction can cause > the > > > >> > bottleneck problem since all the begin/commit/end control records > > will > > > >> have > > > >> > to go through one log stream. > > > >> > > > > >> > the idea of 'transaction group' is to allow partitioning the > writers > > > >> into > > > >> > different transaction groups. > > > >> > > > > >> > clients can specify the `group-name` when starting the > transaction. > > if > > > >> > there is no `group-name` specified, it will use the default > `commit` > > > >> log in > > > >> > the namespace for creating transactions. > > > >> > > > > >> > ------------------------------------------------- > > > >> > > > > >> > I'd like to collect feedbacks on this idea. Appreciate any > comments > > > and > > > >> if > > > >> > anyone is also interested in this idea, we'd like to collaborate > > with > > > >> the > > > >> > community. > > > >> > > > > >> > > > > >> > - Xi > > > >> > > > > >> > > > > > > > > > > > > > >