Thank you, Sijie

We have some internal discussions to sort out some details. We are ready to
collaborate with the community for adding the transaction support in DL.
We'd like to share more.

I created a proposal wiki here -
https://cwiki.apache.org/confluence/display/DL/DP-1+-+DistributedLog+Transaction+Support

(I followed KIP format and named it as DP (DistributedLog Proposal - DP is
also short for Dynamic Programming). I don't know if you guys like this
name or not. Feel free to change it :D)

I basically put my initial email as the content there so far. Once we
finished our final discussion, I will update with more details. At the same
time, any comments are welcome.

- Xi



On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <si...@apache.org> wrote:

> Xi,
>
> I just granted you the edit permission.
>
> - Sijie
>
> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu....@gmail.com> wrote:
>
> > I still can not edit the wiki. Can any of the pmc members grant me the
> > permissions?
> >
> > - Xi
> >
> > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi.liu....@gmail.com> wrote:
> >
> > > Sijie,
> > >
> > > I attempted to create a wiki page under that space. I found that I am
> not
> > > authorized with edit permission.
> > >
> > > Can any of the committers grant me the wiki edit permission? My account
> > is
> > > "xi.liu.ant".
> > >
> > > - Xi
> > >
> > >
> > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <si...@apache.org> wrote:
> > >
> > >> This sounds interesting ... I will take a closer look and give my
> > comments
> > >> later.
> > >>
> > >> At the same time, do you mind creating a wiki page to put your idea
> > there?
> > >> You can add your wiki page under
> > >> https://cwiki.apache.org/confluence/display/DL/Project+Proposals
> > >>
> > >> You might need to ask in the dev list to grant the wiki edit
> permissions
> > >> to
> > >> you once you have a wiki account.
> > >>
> > >> - Sijie
> > >>
> > >>
> > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi.liu....@gmail.com> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I asked the transaction support in distributedlog user group two
> > months
> > >> > ago. I want to raise this up again, as we are looking for using
> > >> > distributedlog for building a transactional data service. It is a
> > major
> > >> > feature that is missing in distributedlog. We have some ideas to add
> > >> this
> > >> > to distributedlog and want to know if they make sense or not. If
> they
> > >> are
> > >> > good, we'd like to contribute and develop with the community.
> > >> >
> > >> > Here are the thoughts:
> > >> >
> > >> > -------------------------------------------------
> > >> >
> > >> > From our understanding, DL can provide "at-least-once" delivery
> > semantic
> > >> > (if not, please correct me) but not "exactly-once" delivery
> semantic.
> > >> That
> > >> > means that a message can be delivered one or more times if the
> reader
> > >> > doesn't handle duplicates.
> > >> >
> > >> > The duplicates come from two places, one is at writer side (this
> > assumes
> > >> > using write proxy not the core library), while the other one is at
> > >> reader
> > >> > side.
> > >> >
> > >> > - writer side: if the client attempts to write a record to the write
> > >> > proxies and gets a network error (e.g timeouts) then retries, the
> > >> retrying
> > >> > will potentially result in duplicates.
> > >> > - reader side:if the reader reads a message from a stream and then
> > >> crashes,
> > >> > when the reader restarts it would restart from last known position
> > >> (DLSN).
> > >> > If the reader fails after processing a record and before recording
> the
> > >> > position, the processed record will be delivered again.
> > >> >
> > >> > The reader problem can be properly addressed by making use of the
> > >> sequence
> > >> > numbers of records and doing proper checkpointing. For example, in
> > >> > database, it can checkpoint the indexed data with the sequence
> number
> > of
> > >> > records; in flink, it can checkpoint the state with the sequence
> > >> numbers.
> > >> >
> > >> > The writer problem can be addressed by implementing an idempotent
> > >> writer.
> > >> > However, an alternative and more powerful approach is to support
> > >> > transactions.
> > >> >
> > >> > *What does transaction mean?*
> > >> >
> > >> > A transaction means a collection of records can be written
> > >> transactionally
> > >> > within a stream or across multiple streams. They will be consumed by
> > the
> > >> > reader together when a transaction is committed, or will never be
> > >> consumed
> > >> > by the reader when the transaction is aborted.
> > >> >
> > >> > The transaction will expose following guarantees:
> > >> >
> > >> > - The reader should not be exposed to records written from
> uncommitted
> > >> > transactions (mandatory)
> > >> > - The reader should consume the records in the transaction commit
> > order
> > >> > rather than the record written order (mandatory)
> > >> > - No duplicated records within a transaction (mandatory)
> > >> > - Allow interleaving transactional writes and non-transactional
> writes
> > >> > (optional)
> > >> >
> > >> > *Stream Transaction & Namespace Transaction*
> > >> >
> > >> > There will be two types of transaction, one is Stream level
> > transaction
> > >> > (local transaction), while the other one is Namespace level
> > transaction
> > >> > (global transaction).
> > >> >
> > >> > The stream level transaction is a transactional operation on writing
> > >> > records to one stream; the namespace level transaction is a
> > >> transactional
> > >> > operation on writing records to multiple streams.
> > >> >
> > >> > *Implementation Thoughts*
> > >> >
> > >> > - A transaction is consist of begin control record, a series of data
> > >> > records and commit/abort control record.
> > >> > - The begin/commit/abort control record is written to a `commit` log
> > >> > stream, while the data records will be written to normal data log
> > >> streams.
> > >> > - The `commit` log stream will be the same log stream for
> stream-level
> > >> > transaction,  while it will be a *system* stream (or multiple system
> > >> > streams) for namespace-level transactions.
> > >> > - The transaction code looks like as below:
> > >> >
> > >> > <code>
> > >> >
> > >> > Transaction txn = client.transaction();
> > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> > >> >
> > >> > </code>
> > >> >
> > >> > if the txn is committed, all the write futures will be satisfied
> with
> > >> their
> > >> > written DLSNs. if the txn is aborted, all the write futures will be
> > >> failed
> > >> > together. there is no partial failure state.
> > >> >
> > >> > - The actually data flow will be:
> > >> >
> > >> > 1. writer get a transaction id from the owner of the `commit' log
> > stream
> > >> > 1. write the begin control record (synchronously) with the
> transaction
> > >> id
> > >> > 2. for each write within the same txn, it will be assigned a local
> > >> sequence
> > >> > number starting from 0. the combination of transaction id and local
> > >> > sequence number will be used later on by the readers to de-duplicate
> > >> > records.
> > >> > 3. the commit/abort control record will be written based on the
> > results
> > >> > from 2.
> > >> >
> > >> > - Application can supply a timeout for the transaction when
> #begin() a
> > >> > transaction. The owner of the `commit` log stream can abort
> > transactions
> > >> > that never be committed/aborted within their timeout.
> > >> >
> > >> > - Failures:
> > >> >
> > >> > * all the log records can be simply retried as they will be
> > >> de-duplicated
> > >> > probably at the reader side.
> > >> >
> > >> > - Reader:
> > >> >
> > >> > * Reader can be configured to read uncommitted records or committed
> > >> records
> > >> > only (by default read uncommitted records)
> > >> > * If reader is configured to read committed records only, the read
> > ahead
> > >> > cache will be changed to maintain one additional pending committed
> > >> records.
> > >> > the pending committed records map is bounded and records will be
> > dropped
> > >> > when read ahead is moving.
> > >> > * when the reader hits a commit record, it will rewind to the begin
> > >> record
> > >> > and start reading from there. leveraging the proper read ahead cache
> > and
> > >> > pending commit records cache, it would be good for both short
> > >> transactions
> > >> > and long transactions.
> > >> >
> > >> > - DLSN, SequenceId:
> > >> >
> > >> > * We will add a fourth field to DLSN. It is `local sequence number`
> > >> within
> > >> > a transaction session. So the new DLSN of records in a transaction
> > will
> > >> be
> > >> > the DLSN of commit control record plus its local sequence number.
> > >> > * The sequence id will be still the position of the commit record
> plus
> > >> its
> > >> > local sequence number. The position will be advanced with total
> number
> > >> of
> > >> > written records on writing the commit control record.
> > >> >
> > >> > - Transaction Group & Namespace Transaction
> > >> >
> > >> > using one single log stream for namespace transaction can cause the
> > >> > bottleneck problem since all the begin/commit/end control records
> will
> > >> have
> > >> > to go through one log stream.
> > >> >
> > >> > the idea of 'transaction group' is to allow partitioning the writers
> > >> into
> > >> > different transaction groups.
> > >> >
> > >> > clients can specify the `group-name` when starting the transaction.
> if
> > >> > there is no `group-name` specified, it will use the default `commit`
> > >> log in
> > >> > the namespace for creating transactions.
> > >> >
> > >> > -------------------------------------------------
> > >> >
> > >> > I'd like to collect feedbacks on this idea. Appreciate any comments
> > and
> > >> if
> > >> > anyone is also interested in this idea, we'd like to collaborate
> with
> > >> the
> > >> > community.
> > >> >
> > >> >
> > >> > - Xi
> > >> >
> > >>
> > >
> > >
> >
>

Reply via email to