Thanks liang. - Xi
On Sun, Dec 18, 2016 at 5:52 PM, liang xie <xieliang...@gmail.com> wrote: > The developers upload the design doc onto JIRA at least for > HADOOP/HBase/Cassandra/... projects > > On Mon, Dec 19, 2016 at 12:48 AM, Sijie Guo <si...@apache.org> wrote: > > Hi Xi, > > > > sorry for late response. I will review it soon. > > > > regarding this, a separate question "are we going to use google doc > instead > > of email thread for any discussion"? I am a bit worried that the > discussion > > will become lost after moving to google doc. No idea on how other apache > > projects are doing. > > > > - Sijie > > > > On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi.liu....@gmail.com> wrote: > > > >> Hi all, > >> > >> I finalized the first version of the design. This time I used a google > doc > >> so that it is easier for commenting and add a link the wiki page. I will > >> update this to the wiki page once we come to the finalized design. > >> > >> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK > >> bSIGgSzXuTI5BA/edit > >> > >> Let me know if you have any questions. Appreciate your reviews! > >> > >> - Xi > >> > >> > >> > >> > >> > >> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart > >> <lstew...@twitter.com.invalid > >> > wrote: > >> > >> > Interesting proposal. A couple quick notes while you continue to flesh > >> this > >> > out. > >> > > >> > a. just to be sure - does this eliminate the need to save seqno with > >> > checkpoint? > >> > > >> > b. i.e. another way to describe this kind of improvement is "support > >> > records (atomic writes) larger than 1MB", iiuc. the advantage being it > >> > avoids the baggage of transactions. disadvantages include inability > to do > >> > cross stream transactions, and flexibility (interleaving, etc) (are > there > >> > others?). > >> > > >> > c. proxy use case is for supporting multiple writers - have you > thought > >> > about how this would work with multiple writers? > >> > > >> > Thanks! > >> > > >> > > >> > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <sij...@twitter.com.invalid > > > >> > wrote: > >> > > >> > > Sound good to me. look forward to the detailed proposal. > >> > > > >> > > (I don't mind the format if it makes things easier to you) > >> > > > >> > > Sijie > >> > > > >> > > On Friday, October 14, 2016, Xi Liu <xi.liu....@gmail.com> wrote: > >> > > > >> > > > Thank you, Sijie > >> > > > > >> > > > We have some internal discussions to sort out some details. We are > >> > ready > >> > > to > >> > > > collaborate with the community for adding the transaction support > in > >> > DL. > >> > > > We'd like to share more. > >> > > > > >> > > > I created a proposal wiki here - > >> > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+ > >> > > > DistributedLog+Transaction+Support > >> > > > > >> > > > (I followed KIP format and named it as DP (DistributedLog > Proposal - > >> DP > >> > > is > >> > > > also short for Dynamic Programming). I don't know if you guys like > >> this > >> > > > name or not. Feel free to change it :D) > >> > > > > >> > > > I basically put my initial email as the content there so far. > Once we > >> > > > finished our final discussion, I will update with more details. At > >> the > >> > > same > >> > > > time, any comments are welcome. > >> > > > > >> > > > - Xi > >> > > > > >> > > > > >> > > > > >> > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <si...@apache.org > >> > > <javascript:;>> > >> > > > wrote: > >> > > > > >> > > > > Xi, > >> > > > > > >> > > > > I just granted you the edit permission. > >> > > > > > >> > > > > - Sijie > >> > > > > > >> > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu....@gmail.com > >> > > > <javascript:;>> wrote: > >> > > > > > >> > > > > > I still can not edit the wiki. Can any of the pmc members > grant > >> me > >> > > the > >> > > > > > permissions? > >> > > > > > > >> > > > > > - Xi > >> > > > > > > >> > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu < > xi.liu....@gmail.com > >> > > > <javascript:;>> wrote: > >> > > > > > > >> > > > > > > Sijie, > >> > > > > > > > >> > > > > > > I attempted to create a wiki page under that space. I found > >> that > >> > I > >> > > am > >> > > > > not > >> > > > > > > authorized with edit permission. > >> > > > > > > > >> > > > > > > Can any of the committers grant me the wiki edit > permission? My > >> > > > account > >> > > > > > is > >> > > > > > > "xi.liu.ant". > >> > > > > > > > >> > > > > > > - Xi > >> > > > > > > > >> > > > > > > > >> > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo < > si...@apache.org > >> > > > <javascript:;>> wrote: > >> > > > > > > > >> > > > > > >> This sounds interesting ... I will take a closer look and > give > >> > my > >> > > > > > comments > >> > > > > > >> later. > >> > > > > > >> > >> > > > > > >> At the same time, do you mind creating a wiki page to put > your > >> > > idea > >> > > > > > there? > >> > > > > > >> You can add your wiki page under > >> > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+ > >> > Proposals > >> > > > > > >> > >> > > > > > >> You might need to ask in the dev list to grant the wiki > edit > >> > > > > permissions > >> > > > > > >> to > >> > > > > > >> you once you have a wiki account. > >> > > > > > >> > >> > > > > > >> - Sijie > >> > > > > > >> > >> > > > > > >> > >> > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu < > xi.liu....@gmail.com > >> > > > <javascript:;>> wrote: > >> > > > > > >> > >> > > > > > >> > Hello, > >> > > > > > >> > > >> > > > > > >> > I asked the transaction support in distributedlog user > group > >> > two > >> > > > > > months > >> > > > > > >> > ago. I want to raise this up again, as we are looking for > >> > using > >> > > > > > >> > distributedlog for building a transactional data > service. It > >> > is > >> > > a > >> > > > > > major > >> > > > > > >> > feature that is missing in distributedlog. We have some > >> ideas > >> > to > >> > > > add > >> > > > > > >> this > >> > > > > > >> > to distributedlog and want to know if they make sense or > >> not. > >> > If > >> > > > > they > >> > > > > > >> are > >> > > > > > >> > good, we'd like to contribute and develop with the > >> community. > >> > > > > > >> > > >> > > > > > >> > Here are the thoughts: > >> > > > > > >> > > >> > > > > > >> > ------------------------------------------------- > >> > > > > > >> > > >> > > > > > >> > From our understanding, DL can provide "at-least-once" > >> > delivery > >> > > > > > semantic > >> > > > > > >> > (if not, please correct me) but not "exactly-once" > delivery > >> > > > > semantic. > >> > > > > > >> That > >> > > > > > >> > means that a message can be delivered one or more times > if > >> the > >> > > > > reader > >> > > > > > >> > doesn't handle duplicates. > >> > > > > > >> > > >> > > > > > >> > The duplicates come from two places, one is at writer > side > >> > (this > >> > > > > > assumes > >> > > > > > >> > using write proxy not the core library), while the other > one > >> > is > >> > > at > >> > > > > > >> reader > >> > > > > > >> > side. > >> > > > > > >> > > >> > > > > > >> > - writer side: if the client attempts to write a record > to > >> the > >> > > > write > >> > > > > > >> > proxies and gets a network error (e.g timeouts) then > >> retries, > >> > > the > >> > > > > > >> retrying > >> > > > > > >> > will potentially result in duplicates. > >> > > > > > >> > - reader side:if the reader reads a message from a stream > >> and > >> > > then > >> > > > > > >> crashes, > >> > > > > > >> > when the reader restarts it would restart from last known > >> > > position > >> > > > > > >> (DLSN). > >> > > > > > >> > If the reader fails after processing a record and before > >> > > recording > >> > > > > the > >> > > > > > >> > position, the processed record will be delivered again. > >> > > > > > >> > > >> > > > > > >> > The reader problem can be properly addressed by making > use > >> of > >> > > the > >> > > > > > >> sequence > >> > > > > > >> > numbers of records and doing proper checkpointing. For > >> > example, > >> > > in > >> > > > > > >> > database, it can checkpoint the indexed data with the > >> sequence > >> > > > > number > >> > > > > > of > >> > > > > > >> > records; in flink, it can checkpoint the state with the > >> > sequence > >> > > > > > >> numbers. > >> > > > > > >> > > >> > > > > > >> > The writer problem can be addressed by implementing an > >> > > idempotent > >> > > > > > >> writer. > >> > > > > > >> > However, an alternative and more powerful approach is to > >> > support > >> > > > > > >> > transactions. > >> > > > > > >> > > >> > > > > > >> > *What does transaction mean?* > >> > > > > > >> > > >> > > > > > >> > A transaction means a collection of records can be > written > >> > > > > > >> transactionally > >> > > > > > >> > within a stream or across multiple streams. They will be > >> > > consumed > >> > > > by > >> > > > > > the > >> > > > > > >> > reader together when a transaction is committed, or will > >> never > >> > > be > >> > > > > > >> consumed > >> > > > > > >> > by the reader when the transaction is aborted. > >> > > > > > >> > > >> > > > > > >> > The transaction will expose following guarantees: > >> > > > > > >> > > >> > > > > > >> > - The reader should not be exposed to records written > from > >> > > > > uncommitted > >> > > > > > >> > transactions (mandatory) > >> > > > > > >> > - The reader should consume the records in the > transaction > >> > > commit > >> > > > > > order > >> > > > > > >> > rather than the record written order (mandatory) > >> > > > > > >> > - No duplicated records within a transaction (mandatory) > >> > > > > > >> > - Allow interleaving transactional writes and > >> > non-transactional > >> > > > > writes > >> > > > > > >> > (optional) > >> > > > > > >> > > >> > > > > > >> > *Stream Transaction & Namespace Transaction* > >> > > > > > >> > > >> > > > > > >> > There will be two types of transaction, one is Stream > level > >> > > > > > transaction > >> > > > > > >> > (local transaction), while the other one is Namespace > level > >> > > > > > transaction > >> > > > > > >> > (global transaction). > >> > > > > > >> > > >> > > > > > >> > The stream level transaction is a transactional > operation on > >> > > > writing > >> > > > > > >> > records to one stream; the namespace level transaction > is a > >> > > > > > >> transactional > >> > > > > > >> > operation on writing records to multiple streams. > >> > > > > > >> > > >> > > > > > >> > *Implementation Thoughts* > >> > > > > > >> > > >> > > > > > >> > - A transaction is consist of begin control record, a > series > >> > of > >> > > > data > >> > > > > > >> > records and commit/abort control record. > >> > > > > > >> > - The begin/commit/abort control record is written to a > >> > `commit` > >> > > > log > >> > > > > > >> > stream, while the data records will be written to normal > >> data > >> > > log > >> > > > > > >> streams. > >> > > > > > >> > - The `commit` log stream will be the same log stream for > >> > > > > stream-level > >> > > > > > >> > transaction, while it will be a *system* stream (or > >> multiple > >> > > > system > >> > > > > > >> > streams) for namespace-level transactions. > >> > > > > > >> > - The transaction code looks like as below: > >> > > > > > >> > > >> > > > > > >> > <code> > >> > > > > > >> > > >> > > > > > >> > Transaction txn = client.transaction(); > >> > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record); > >> > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record); > >> > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record); > >> > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit(); > >> > > > > > >> > > >> > > > > > >> > </code> > >> > > > > > >> > > >> > > > > > >> > if the txn is committed, all the write futures will be > >> > satisfied > >> > > > > with > >> > > > > > >> their > >> > > > > > >> > written DLSNs. if the txn is aborted, all the write > futures > >> > will > >> > > > be > >> > > > > > >> failed > >> > > > > > >> > together. there is no partial failure state. > >> > > > > > >> > > >> > > > > > >> > - The actually data flow will be: > >> > > > > > >> > > >> > > > > > >> > 1. writer get a transaction id from the owner of the > >> `commit' > >> > > log > >> > > > > > stream > >> > > > > > >> > 1. write the begin control record (synchronously) with > the > >> > > > > transaction > >> > > > > > >> id > >> > > > > > >> > 2. for each write within the same txn, it will be > assigned a > >> > > local > >> > > > > > >> sequence > >> > > > > > >> > number starting from 0. the combination of transaction id > >> and > >> > > > local > >> > > > > > >> > sequence number will be used later on by the readers to > >> > > > de-duplicate > >> > > > > > >> > records. > >> > > > > > >> > 3. the commit/abort control record will be written based > on > >> > the > >> > > > > > results > >> > > > > > >> > from 2. > >> > > > > > >> > > >> > > > > > >> > - Application can supply a timeout for the transaction > when > >> > > > > #begin() a > >> > > > > > >> > transaction. The owner of the `commit` log stream can > abort > >> > > > > > transactions > >> > > > > > >> > that never be committed/aborted within their timeout. > >> > > > > > >> > > >> > > > > > >> > - Failures: > >> > > > > > >> > > >> > > > > > >> > * all the log records can be simply retried as they will > be > >> > > > > > >> de-duplicated > >> > > > > > >> > probably at the reader side. > >> > > > > > >> > > >> > > > > > >> > - Reader: > >> > > > > > >> > > >> > > > > > >> > * Reader can be configured to read uncommitted records or > >> > > > committed > >> > > > > > >> records > >> > > > > > >> > only (by default read uncommitted records) > >> > > > > > >> > * If reader is configured to read committed records only, > >> the > >> > > read > >> > > > > > ahead > >> > > > > > >> > cache will be changed to maintain one additional pending > >> > > committed > >> > > > > > >> records. > >> > > > > > >> > the pending committed records map is bounded and records > >> will > >> > be > >> > > > > > dropped > >> > > > > > >> > when read ahead is moving. > >> > > > > > >> > * when the reader hits a commit record, it will rewind to > >> the > >> > > > begin > >> > > > > > >> record > >> > > > > > >> > and start reading from there. leveraging the proper read > >> ahead > >> > > > cache > >> > > > > > and > >> > > > > > >> > pending commit records cache, it would be good for both > >> short > >> > > > > > >> transactions > >> > > > > > >> > and long transactions. > >> > > > > > >> > > >> > > > > > >> > - DLSN, SequenceId: > >> > > > > > >> > > >> > > > > > >> > * We will add a fourth field to DLSN. It is `local > sequence > >> > > > number` > >> > > > > > >> within > >> > > > > > >> > a transaction session. So the new DLSN of records in a > >> > > transaction > >> > > > > > will > >> > > > > > >> be > >> > > > > > >> > the DLSN of commit control record plus its local sequence > >> > > number. > >> > > > > > >> > * The sequence id will be still the position of the > commit > >> > > record > >> > > > > plus > >> > > > > > >> its > >> > > > > > >> > local sequence number. The position will be advanced with > >> > total > >> > > > > number > >> > > > > > >> of > >> > > > > > >> > written records on writing the commit control record. > >> > > > > > >> > > >> > > > > > >> > - Transaction Group & Namespace Transaction > >> > > > > > >> > > >> > > > > > >> > using one single log stream for namespace transaction can > >> > cause > >> > > > the > >> > > > > > >> > bottleneck problem since all the begin/commit/end control > >> > > records > >> > > > > will > >> > > > > > >> have > >> > > > > > >> > to go through one log stream. > >> > > > > > >> > > >> > > > > > >> > the idea of 'transaction group' is to allow partitioning > the > >> > > > writers > >> > > > > > >> into > >> > > > > > >> > different transaction groups. > >> > > > > > >> > > >> > > > > > >> > clients can specify the `group-name` when starting the > >> > > > transaction. > >> > > > > if > >> > > > > > >> > there is no `group-name` specified, it will use the > default > >> > > > `commit` > >> > > > > > >> log in > >> > > > > > >> > the namespace for creating transactions. > >> > > > > > >> > > >> > > > > > >> > ------------------------------------------------- > >> > > > > > >> > > >> > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate > any > >> > > > comments > >> > > > > > and > >> > > > > > >> if > >> > > > > > >> > anyone is also interested in this idea, we'd like to > >> > collaborate > >> > > > > with > >> > > > > > >> the > >> > > > > > >> > community. > >> > > > > > >> > > >> > > > > > >> > > >> > > > > > >> > - Xi > >> > > > > > >> > > >> > > > > > >> > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> >