> Ah- yes good point (to be clear we're not using the proxy this way today).

> > Due to the source of the
> > data (HBase Replication), we cannot guarantee that a single partition
will
> > be owned for writes by the same client.

> Do you mean you *need* to support multiple writers issuing interleaved
> writes or is it just that they might sometimes interleave writes and you
>don't care?
How HBase partitions the keys being written wouldn't have a one->one
mapping with the partitions we would have in HBase. Even if we did have
that alignment when the cluster first started, HBase will rebalance what
servers own what partitions, as well as split and merge partitions that
already exist, causing eventual drift from one log per partition.
Because we want ordering guarantees per key (row in hbase), we partition
the logs by the key. Since multiple writers are possible per range of keys
(due to the aforementioned rebalancing / splitting / etc of hbase), we
cannot use the core library due to requiring a single writer for ordering.

But, for a single log, we don't really care about ordering aside from at
the per-key level. So all we really need to be able to handle is preventing
duplicates when a failure occurs, and ordering consistency across requests
from a single client.

So our general requirements are:
Write A, Write B
Timeline: A -> B
Request B is only made after A has successfully returned (possibly after
retries)

1) If the write succeeds, it will be durably exposed to clients within some
bounded time frame
2) If A succeeds and B succeeds, the ordering for the log will be A and
then B
3) If A fails due to an error that can be relied on to *not* be a lost ack
problem, it will never be exposed to the client, so it may (depending on
the error) be retried immediately
4) If A fails due to an error that could be a lost ack problem (network
connectivity / etc), within a bounded time frame it should be possible to
find out if the write succeed or failed. Either by reading from some
checkpoint of the log for the changes that should have been made or some
other possible server-side support.
5) If A is turned into multiple batches (one large request gets split into
multiple smaller ones to the bookkeeper backend, due to log rolling / size
/ etc):
  a) The ordering of entries within batches have ordering consistence with
the original request, when exposed in the log (though they may be
interleaved with other requests)
  b) The ordering across batches have ordering consistence with the
original request, when exposed in the log (though they may be interleaved
with other requests)
  c) If a batch fails, and cannot be retried / is unsuccessfully retried,
all batches after the failed batch should not be exposed in the log. Note:
The batches before and including the failed batch, that ended up
succeeding, can show up in the log, again within some bounded time range
for reads by a client.

Since we can guarantee per-key ordering on the client side, we guarantee
that there is a single writer per-key, just not per log. So if there was a
way to guarantee a single write request as being written or not, within a
certain time frame (since failures should be rare anyways, this is fine if
it is expensive), we can then have the client guarantee the ordering it
needs.


> Cameron:
> Another thing we've discussed but haven't really thought through -
> We might be able to support some kind of epoch write request, where the
> epoch is guaranteed to have changed if the writer has changed or the
ledger
> was ever fenced off. Writes include an epoch and are rejected if the epoch
> has changed.
> With a mechanism like this, fencing the ledger off after a failure would
> ensure any pending writes had either been written or would be rejected.

The issue would be how I guarantee the write I wrote to the server was
written. Since a network issue could happen on the send of the request, or
on the receive of the success response, an epoch wouldn't tell me if I can
successfully retry, as it could be successfully written but AWS dropped the
connection for the success response. Since the epoch would be the same
(same ledger), I could write duplicates.


> We are currently proposing adding a transaction semantic to dl to get rid
> of the size limitation and the unaware-ness in the proxy client. Here is
> our idea -
> http://mail-archives.apache.org/mod_mbox/incubator-distributedlog
-dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=xuYwYm
l4nxsybyiofzxpv...@mail.gmail.com%3e

> I am not sure if your idea is similar as ours. but we'd like to
collaborate
> with the community if anyone has the similar idea.

Our use case would be covered by transaction support, but I'm unsure if we
would need something that heavy weight for the guarantees we need.


Basically, the high level requirement here is "Support consistent write
ordering for single-writer-per-key, multi-writer-per-log". My hunch is
that, with some added guarantees to the proxy (if it isn't already
supported), and some custom client code on our side for removing the
entries that actually succeed to write to DistributedLog from the request
that failed, it should be a relatively easy thing to support.

Thanks,
Cameron


On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart <lstew...@twitter.com.invalid>
wrote:

> Cameron:
> Another thing we've discussed but haven't really thought through -
> We might be able to support some kind of epoch write request, where the
> epoch is guaranteed to have changed if the writer has changed or the ledger
> was ever fenced off. Writes include an epoch and are rejected if the epoch
> has changed.
> With a mechanism like this, fencing the ledger off after a failure would
> ensure any pending writes had either been written or would be rejected.
>
>
> On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <si...@apache.org> wrote:
>
> > Cameron,
> >
> > I think both Leigh and Xi had made a few good points about your question.
> >
> > To add one more point to your question - "but I am not
> > 100% of how all of the futures in the code handle failures.
> > If not, where in the code would be the relevant places to add the ability
> > to do this, and would the project be interested in a pull request?"
> >
> > The current proxy and client logic doesn't do perfectly on handling
> > failures (duplicates) - the strategy now is the client will retry as best
> > at it can before throwing exceptions to users. The code you are looking
> for
> > - it is on BKLogSegmentWriter for the proxy handling writes and it is on
> > DistributedLogClientImpl for the proxy client handling responses from
> > proxies. Does this help you?
> >
> > And also, you are welcome to contribute the pull requests.
> >
> > - Sijie
> >
> >
> >
> > On Tue, Oct 4, 2016 at 3:39 PM, Cameron Hatfield <kin...@gmail.com>
> wrote:
> >
> > > I have a question about the Proxy Client. Basically, for our use cases,
> > we
> > > want to guarantee ordering at the key level, irrespective of the
> ordering
> > > of the partition it may be assigned to as a whole. Due to the source of
> > the
> > > data (HBase Replication), we cannot guarantee that a single partition
> > will
> > > be owned for writes by the same client. This means the proxy client
> works
> > > well (since we don't care which proxy owns the partition we are writing
> > > to).
> > >
> > >
> > > However, the guarantees we need when writing a batch consists of:
> > > Definition of a Batch: The set of records sent to the writeBatch
> endpoint
> > > on the proxy
> > >
> > > 1. Batch success: If the client receives a success from the proxy, then
> > > that batch is successfully written
> > >
> > > 2. Inter-Batch ordering : Once a batch has been written successfully by
> > the
> > > client, when another batch is written, it will be guaranteed to be
> > ordered
> > > after the last batch (if it is the same stream).
> > >
> > > 3. Intra-Batch ordering: Within a batch of writes, the records will be
> > > committed in order
> > >
> > > 4. Intra-Batch failure ordering: If an individual record fails to write
> > > within a batch, all records after that record will not be written.
> > >
> > > 5. Batch Commit: Guarantee that if a batch returns a success, it will
> be
> > > written
> > >
> > > 6. Read-after-write: Once a batch is committed, within a limited
> > time-frame
> > > it will be able to be read. This is required in the case of failure, so
> > > that the client can see what actually got committed. I believe the
> > > time-frame part could be removed if the client can send in the same
> > > sequence number that was written previously, since it would then fail
> and
> > > we would know that a read needs to occur.
> > >
> > >
> > > So, my basic question is if this is currently possible in the proxy? I
> > > don't believe it gives these guarantees as it stands today, but I am
> not
> > > 100% of how all of the futures in the code handle failures.
> > > If not, where in the code would be the relevant places to add the
> ability
> > > to do this, and would the project be interested in a pull request?
> > >
> > >
> > > Thanks,
> > > Cameron
> > >
> >
>

Reply via email to