Andrew Purtell created HBASE-21831:
--------------------------------------
Summary: Optional store-and-forward of simple mutations for
regions in transition
Key: HBASE-21831
URL: https://issues.apache.org/jira/browse/HBASE-21831
Project: HBase
Issue Type: New Feature
Components: regionserver, rpc
Reporter: Andrew Purtell
Fix For: 3.0.0, 1.6.0, 2.3.0
We have an internal service built on Redis that is considering writing through
to HBase directly for their persistence needs. Their current experience with
Redis is
* Average write latency is ~milliseconds
* p999 write latencies with are "a few seconds"
They want a similar experience when writing simple values directly to HBase.
Infrequent exceptions to this would be acceptable.
* Availability of 99.9% for writes
* Expect most writes to be serviced within a few milliseconds, e.g. few millis
at p95. Still evaluating what the requirement should be (~millis at p90 vs p95
vs p99).
* Timeout of 2 seconds, should be rare
There is a fallback plan considered if HBase cannot respond within 2 seconds.
However this fallback cannot guarantee durability. Redis or the service's
daemons may go down. They want HBase to provide required durability.
Because this is a caching service, where all writes are expected to be served
again from cache, at least for a while, if HBase were to accept writes such
that they are not immediately visible, it could be fine that they are not
visible for 10-20 minutes in the worst case. This is relatively easy to achieve
as an engineering target should we consider offering a write option that does
not guarantee immediate visibility. (A proposal follows below.) We are
considering store-and-forward of simple mutations and perhaps also simple
deletes, although the latter is not a hard requirement. Out of order processing
of this subset of mutation requests is acceptable because their data model
ensures all values are immutable. Presumably on the HBase side the timestamps
of the requests would be set to the current server wall clock time when
received, so eventually when applied all are available with correct temporal
ordering (within the effective resolution of the server clocks). Deletes which
are not immediately applied (or failed) could cause application level
confusion, and although this would remain a concern for the general case, for
this specific use case, stale reads could be explained to and tolerated by
their users.
The BigTable architecture assigns at most one server to serve a region at a
time. Region Replicas are an enhancement to the base BigTable architecture we
made in HBase which stands up two more read-only replicas for a given region,
meaning a client attempting a read has the option to fail very quickly over
from the primary to a replica for a (potentially stale) read, or distribute
read load over all replicas, or employ a hedged reading strategy. Enabling
region replicas and timeline consistency can lower the availability gap for
reads in the high percentiles from ~minutes to ~milliseconds. However, this
option will not help for write use cases wanting roughly the same thing,
because there can be no fail-over for writes. Writes must still go to the
active primary. When that region is in transition, writes must be held on the
client until it is redeployed. Or, if region replicas are not enabled, when the
sole region is in transition, again, writes must be held on the client until
the region is available again.
Regions enter the in-transition state for two reasons: failures, and
housekeeping (splits and merges, or balancing). Time to region redeployment
after failures depends on a number of factors, like how long it took for us to
become aware of the failure, and how long it takes to split the write-ahead log
of the failed server and distribute the recovered edits to the reopening
region(s). We could in theory improve this behavior by being more predictive
about declaring failure, like employing a phi accrual failure detector to
signal to the master from clients that a regionserver is sick. Other
time-to-recovery issues and mitigations are discussed in a number of JIRAs and
blog posts and not discussed further here. Regarding housekeeping activities,
splits and merges typically complete in under a second. However, split times up
to ~30 seconds have been observed at my place of employ in rare conditions. In
the instances I have investigated the cause is I/O stalls on the datanodes and
metadata request stalls in the namenode, so not unexpected outlier cases.
Mitigating these risks involve looking at split and policies. Split and merge
policies are pluggable, and policy choices can be applied per table. In extreme
cases, auto-splitting (and auto-merging) can be disabled on performance
sensitive tables and accomplished through manual means during scheduled
maintenance windows. Regions may also be moved by the Balancer to avoid
unbalanced loading over the available cluster resources. During balancing, one
or more regions are closed on some servers, then opened on others. While
closing, a region must flush all of its memstores, yet will not accept any new
requests during flushing, because it is closing. This can lead to short
availability gaps. The Balancer's strategy can be tuned, or on clusters where
any disruption is undesirable, the balancer can be disabled, and
enabled/invoked manually only during scheduled maintenance either by admin API
or by plugging in a custom implementation that does nothing. _While these
options are available, they are needlessly complex to consider for use cases
that can be satisfied with simple dogged store-and-forward of mutations
accepted on behalf of an unavailable region_. _It would be far simpler from the
user perspective to offer a new flag for mutation requests._ It may also not be
tenable to apply the global configuration changes discussed above in this
paragraph to a multitenant cluster.
The requirement to always take writes even under partial failure conditions is
a prime motivator for the development of eventually consistent systems. However
while those systems can accept writes under a wider range of failure conditions
than others, like HBase, which strive for consistency, they cannot guarantee
those writes are immediately available for reads. Far from it. The guarantees
about data availability and freshness are reduced or eliminated in eventually
consistent designs. Consistent semantics remain highly desirable even though we
have to make availability tradeoffs. Eventually consistent designs expose data
inconsistency issues to their applications, and this is a constant pain point
for even the best developers. We want to retain HBase's consistent semantics
and operational model for the vast majority of use cases. That said, we can
look at some changes that improve the apparent availability of an HBase cluster
for a subset of simple mutation requests, for use cases that want to relax some
guarantees for writes in a similar manner as we have done earlier for reads via
the read replica feature.
If we accept the requirement to always accept writes, if any server is
available, and there is no need to make them immediately visible, we can
introduce a new write request attribute that says "it is fine to accept this on
behalf of the now or future region holder, in a store-and-forward manner", for
a subset of possible write operations: Append and Increment requires the server
to return the up-to-date result, so are not possible. CheckAndXXX operations
likewise must be executed at the primary. Deletes could be dangerous to apply
out of order and so should not be accepted as a rule. Perhaps simple deletes
could be supported, if an additional safety valve is switched off in
configuration, but not DeleteColumn or DeleteFamily. Simple Puts and multi-puts
(batch Put[] or RowMutations) can be safely serviced. This can still satisfy
requirements for dogged persistence of simple writes and benefit a range of use
cases. Should the primary region be unavailable for accepting this subset of
writes, the client would contact another regionserver, any regionserver, with
the new operation flag set, and that regionserver would then accept the write
on behalf of the future holder of the in-transition region. (Technically, a
client could set this flag at any time for any reason.) Regionservers to which
writes are handed off must then efficiently and doggedly drain their
store-and-forward queue. This queue must be durable across process failures and
restarts. We can use existing regionserver WAL facilities to support this. This
would be similar in some ways to how cross cluster replication is implemented.
Edits are persisted to the WAL at the source, the WAL entries are later
enumerated and applied at the sink. The differences here are:
* Regionservers would accept edits for regions they are not currently
servicing.
* WAL replay must also handle these "martian" edits, adding them to the
store-and-forward queue of the regionserver recovering the WAL.
* Regionservers will queue such edits and apply them to the local cluster
instead of shipping them out; in other words, the local regionserver acts as a
replication sink, not a source.
There could be no guarantee on the eventual visibility of requests accepted in
this manner, and writes accepted into store-and-forward queues may be applied
out of order, although the timestamp component of HBase keys will ensure
correct temporal ordering (within the effective resolution of the server
clocks) after all are eventually applied. This is consistent with the semantics
one gets with eventually consistent systems. This would not be default
behavior, nor default semantics. HBase would continue to trade off for
consistency, unless this new feature/flag is enabled by an informed party. This
is consistent with the strategy we adopted for region replicas.
Like with cross-cluster replication we would want to provide metrics on the
depth of the queues and maximum age of entries in these queues, so operators
can get a sense of how far behind they might be, to determine compliance with
application service level objectives.
Implementation of this feature should satisfy compatibility policy constraints
such that minor releases can accept it. At the very least we would require it
in a new branch-1 minor. This is a hard requirement.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)