[
https://issues.apache.org/jira/browse/HBASE-21831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Zhang updated HBASE-21831:
------------------------------
Fix Version/s: 4.0.0-alpha-1
(was: 3.0.0-beta-2)
> Optional store-and-forward of simple mutations for regions in transition
> ------------------------------------------------------------------------
>
> Key: HBASE-21831
> URL: https://issues.apache.org/jira/browse/HBASE-21831
> Project: HBase
> Issue Type: New Feature
> Components: regionserver, rpc
> Reporter: Andrew Kyle Purtell
> Priority: Major
> Fix For: 4.0.0-alpha-1
>
>
> We have an internal service built on Redis that is considering writing
> through to HBase directly for their persistence needs. Their current
> experience with Redis is
> * Average write latency is ~milliseconds
> * p999 write latencies with are "a few seconds"
> They want a similar experience when writing simple values directly to HBase.
> Infrequent exceptions to this would be acceptable.
> * Availability of 99.9% for writes
> * Expect most writes to be serviced within a few milliseconds, e.g. few
> millis at p95. Still evaluating what the requirement should be (~millis at
> p90 vs p95 vs p99).
> * Timeout of 2 seconds, should be rare
> There is a fallback plan considered if HBase cannot respond within 2 seconds.
> However this fallback cannot guarantee durability. Redis or the service's
> daemons may go down. They want HBase to provide required durability.
> Because this is a caching service, where all writes are expected to be served
> again from cache, at least for a while, if HBase were to accept writes such
> that they are not immediately visible, it could be fine that they are not
> visible for 10-20 minutes in the worst case. This is relatively easy to
> achieve as an engineering target should we consider offering a write option
> that does not guarantee immediate visibility. (A proposal follows below.) We
> are considering store-and-forward of simple mutations and perhaps also simple
> deletes, although the latter is not a hard requirement. Out of order
> processing of this subset of mutation requests is acceptable because their
> data model ensures all values are immutable. Presumably on the HBase side the
> timestamps of the requests would be set to the current server wall clock time
> when received, so eventually when applied all are available with correct
> temporal ordering (within the effective resolution of the server clocks).
> Deletes which are not immediately applied (or failed) could cause application
> level confusion, and although this would remain a concern for the general
> case, for this specific use case, stale reads could be explained to and
> tolerated by their users.
> The BigTable architecture assigns at most one server to serve a region at a
> time. Region Replicas are an enhancement to the base BigTable architecture we
> made in HBase which stands up two more read-only replicas for a given region,
> meaning a client attempting a read has the option to fail very quickly over
> from the primary to a replica for a (potentially stale) read, or distribute
> read load over all replicas, or employ a hedged reading strategy. Enabling
> region replicas and timeline consistency can lower the availability gap for
> reads in the high percentiles from ~minutes to ~milliseconds. However, this
> option will not help for write use cases wanting roughly the same thing,
> because there can be no fail-over for writes. Writes must still go to the
> active primary. When that region is in transition, writes must be held on the
> client until it is redeployed. Or, if region replicas are not enabled, when
> the sole region is in transition, again, writes must be held on the client
> until the region is available again.
> Regions enter the in-transition state for two reasons: failures, and
> housekeeping (splits and merges, or balancing). Time to region redeployment
> after failures depends on a number of factors, like how long it took for us
> to become aware of the failure, and how long it takes to split the
> write-ahead log of the failed server and distribute the recovered edits to
> the reopening region(s). We could in theory improve this behavior by being
> more predictive about declaring failure, like employing a phi accrual failure
> detector to signal to the master from clients that a regionserver is sick.
> Other time-to-recovery issues and mitigations are discussed in a number of
> JIRAs and blog posts and not discussed further here. Regarding housekeeping
> activities, splits and merges typically complete in under a second. However,
> split times up to ~30 seconds have been observed at my place of employ in
> rare conditions. In the instances I have investigated the cause is I/O stalls
> on the datanodes and metadata request stalls in the namenode, so not
> unexpected outlier cases. Mitigating these risks involve looking at split and
> policies. Split and merge policies are pluggable, and policy choices can be
> applied per table. In extreme cases, auto-splitting (and auto-merging) can be
> disabled on performance sensitive tables and accomplished through manual
> means during scheduled maintenance windows. Regions may also be moved by the
> Balancer to avoid unbalanced loading over the available cluster resources.
> During balancing, one or more regions are closed on some servers, then opened
> on others. While closing, a region must flush all of its memstores, yet will
> not accept any new requests during flushing, because it is closing. This can
> lead to short availability gaps. The Balancer's strategy can be tuned, or on
> clusters where any disruption is undesirable, the balancer can be disabled,
> and enabled/invoked manually only during scheduled maintenance either by
> admin API or by plugging in a custom implementation that does nothing. _While
> these options are available, they are needlessly complex to consider for use
> cases that can be satisfied with simple dogged store-and-forward of mutations
> accepted on behalf of an unavailable region_. _It would be far simpler from
> the user perspective to offer a new flag for mutation requests._ It may also
> not be tenable to apply the global configuration changes discussed above in
> this paragraph to a multitenant cluster.
> The requirement to always take writes even under partial failure conditions
> is a prime motivator for the development of eventually consistent systems.
> However while those systems can accept writes under a wider range of failure
> conditions than others, like HBase, which strive for consistency, they cannot
> guarantee those writes are immediately available for reads. Far from it. The
> guarantees about data availability and freshness are reduced or eliminated in
> eventually consistent designs. Consistent semantics remain highly desirable
> even though we have to make availability tradeoffs. Eventually consistent
> designs expose data inconsistency issues to their applications, and this is a
> constant pain point for even the best developers. We want to retain HBase's
> consistent semantics and operational model for the vast majority of use
> cases. That said, we can look at some changes that improve the apparent
> availability of an HBase cluster for a subset of simple mutation requests,
> for use cases that want to relax some guarantees for writes in a similar
> manner as we have done earlier for reads via the read replica feature.
> If we accept the requirement to always accept writes, if any server is
> available, and there is no need to make them immediately visible, we can
> introduce a new write request attribute that says "it is fine to accept this
> on behalf of the now or future region holder, in a store-and-forward manner",
> for a subset of possible write operations: Append and Increment requires the
> server to return the up-to-date result, so are not possible. CheckAndXXX
> operations likewise must be executed at the primary. Deletes could be
> dangerous to apply out of order and so should not be accepted as a rule.
> Perhaps simple deletes could be supported, if an additional safety valve is
> switched off in configuration, but not DeleteColumn or DeleteFamily. Simple
> Puts and multi-puts (batch Put[] or RowMutations) can be safely serviced.
> This can still satisfy requirements for dogged persistence of simple writes
> and benefit a range of use cases. Should the primary region be unavailable
> for accepting this subset of writes, the client would contact another
> regionserver, any regionserver, with the new operation flag set, and that
> regionserver would then accept the write on behalf of the future holder of
> the in-transition region. (Technically, a client could set this flag at any
> time for any reason.) Regionservers to which writes are handed off must then
> efficiently and doggedly drain their store-and-forward queue. This queue must
> be durable across process failures and restarts. We can use existing
> regionserver WAL facilities to support this. This would be similar in some
> ways to how cross cluster replication is implemented. Edits are persisted to
> the WAL at the source, the WAL entries are later enumerated and applied at
> the sink. The differences here are:
> * Regionservers would accept edits for regions they are not currently
> servicing.
> * WAL replay must also handle these "martian" edits, adding them to the
> store-and-forward queue of the regionserver recovering the WAL.
> * Regionservers will queue such edits and apply them to the local cluster
> instead of shipping them out; in other words, the local regionserver acts as
> a replication sink, not a source.
> There could be no guarantee on the eventual visibility of requests accepted
> in this manner, and writes accepted into store-and-forward queues may be
> applied out of order, although the timestamp component of HBase keys will
> ensure correct temporal ordering (within the effective resolution of the
> server clocks) after all are eventually applied. This is consistent with the
> semantics one gets with eventually consistent systems. This would not be
> default behavior, nor default semantics. HBase would continue to trade off
> for consistency, unless this new feature/flag is enabled by an informed
> party. This is consistent with the strategy we adopted for region replicas.
>
> Like with cross-cluster replication we would want to provide metrics on the
> depth of the queues and maximum age of entries in these queues, so operators
> can get a sense of how far behind they might be, to determine compliance with
> application service level objectives.
> Implementation of this feature should satisfy compatibility policy
> constraints such that minor releases can accept it. At the very least we
> would require it in a new branch-1 minor. This is a hard requirement.
> Concerns about ordering of value versions and application of operation
> precedence rules (e.g. deletes before puts) within a single clock tick can be
> mitigated by an additional, relatively small change: We can ensure that only
> one operation per row can be committed per clock tick. (The row is our unit
> of atomicity and the scope where monotonicity in timestamp assignment will be
> useful.) In the normal case the overhead is only an extra long for tracking
> the last commit time. We already need to get the current time for other
> reasons. In the edge case we must spin until the clock ticks over. The impact
> on throughput and CPU will depend on how often we hit this case, but it is
> expected to be rare. Operators should also ensure, through monitoring of
> clock drift over the fleet, that no clock is so far ahead that region
> failover from one server to another will defeat the monotonicity gained by
> this strategy.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)