Andrew Purtell created HBASE-21831:
--------------------------------------

             Summary: Optional store-and-forward of simple mutations for 
regions in transition
                 Key: HBASE-21831
                 URL: https://issues.apache.org/jira/browse/HBASE-21831
             Project: HBase
          Issue Type: New Feature
          Components: regionserver, rpc
            Reporter: Andrew Purtell
             Fix For: 3.0.0, 1.6.0, 2.3.0


We have an internal service built on Redis that is considering writing through 
to HBase directly for their persistence needs. Their current experience with 
Redis is
 * Average write latency is ~milliseconds
 * p999 write latencies with are "a few seconds"

They want a similar experience when writing simple values directly to HBase. 
Infrequent exceptions to this would be acceptable. 
 * Availability of 99.9% for writes
 * Expect most writes to be serviced within a few milliseconds, e.g. few millis 
at p95. Still evaluating what the requirement should be (~millis at p90 vs p95 
vs p99).
 * Timeout of 2 seconds, should be rare

There is a fallback plan considered if HBase cannot respond within 2 seconds. 
However this fallback cannot guarantee durability. Redis or the service's 
daemons may go down. They want HBase to provide required durability.

Because this is a caching service, where all writes are expected to be served 
again from cache, at least for a while, if HBase were to accept writes such 
that they are not immediately visible, it could be fine that they are not 
visible for 10-20 minutes in the worst case. This is relatively easy to achieve 
as an engineering target should we consider offering a write option that does 
not guarantee immediate visibility. (A proposal follows below.) We are 
considering store-and-forward of simple mutations and perhaps also simple 
deletes, although the latter is not a hard requirement. Out of order processing 
of this subset of mutation requests is acceptable because their data model 
ensures all values are immutable. Presumably on the HBase side the timestamps 
of the requests would be set to the current server wall clock time when 
received, so eventually when applied all are available with correct temporal 
ordering (within the effective resolution of the server clocks). Deletes which 
are not immediately applied (or failed) could cause application level 
confusion, and although this would remain a concern for the general case, for 
this specific use case, stale reads could be explained to and tolerated by 
their users.

The BigTable architecture assigns at most one server to serve a region at a 
time. Region Replicas are an enhancement to the base BigTable architecture we 
made in HBase which stands up two more read-only replicas for a given region, 
meaning a client attempting a read has the option to fail very quickly over 
from the primary to a replica for a (potentially stale) read, or distribute 
read load over all replicas, or employ a hedged reading strategy. Enabling 
region replicas and timeline consistency can lower the availability gap for 
reads in the high percentiles from ~minutes to ~milliseconds. However, this 
option will not help for write use cases wanting roughly the same thing, 
because there can be no fail-over for writes. Writes must still go to the 
active primary. When that region is in transition, writes must be held on the 
client until it is redeployed. Or, if region replicas are not enabled, when the 
sole region is in transition, again, writes must be held on the client until 
the region is available again.

Regions enter the in-transition state for two reasons: failures, and 
housekeeping (splits and merges, or balancing). Time to region redeployment 
after failures depends on a number of factors, like how long it took for us to 
become aware of the failure, and how long it takes to split the write-ahead log 
of the failed server and distribute the recovered edits to the reopening 
region(s). We could in theory improve this behavior by being more predictive 
about declaring failure, like employing a phi accrual failure detector to 
signal to the master from clients that a regionserver is sick. Other 
time-to-recovery issues and mitigations are discussed in a number of JIRAs and 
blog posts and not discussed further here. Regarding housekeeping activities, 
splits and merges typically complete in under a second. However, split times up 
to ~30 seconds have been observed at my place of employ in rare conditions. In 
the instances I have investigated the cause is I/O stalls on the datanodes and 
metadata request stalls in the namenode, so not unexpected outlier cases. 
Mitigating these risks involve looking at split and policies. Split and merge 
policies are pluggable, and policy choices can be applied per table. In extreme 
cases, auto-splitting (and auto-merging) can be disabled on performance 
sensitive tables and accomplished through manual means during scheduled 
maintenance windows. Regions may also be moved by the Balancer to avoid 
unbalanced loading over the available cluster resources. During balancing, one 
or more regions are closed on some servers, then opened on others. While 
closing, a region must flush all of its memstores, yet will not accept any new 
requests during flushing, because it is closing. This can lead to short 
availability gaps. The Balancer's strategy can be tuned, or on clusters where 
any disruption is undesirable, the balancer can be disabled, and 
enabled/invoked manually only during scheduled maintenance either by admin API 
or by plugging in a custom implementation that does nothing. _While these 
options are available, they are needlessly complex to consider for use cases 
that can be satisfied with simple dogged store-and-forward of mutations 
accepted on behalf of an unavailable region_. _It would be far simpler from the 
user perspective to offer a new flag for mutation requests._ It may also not be 
tenable to apply the global configuration changes discussed above in this 
paragraph to a multitenant cluster. 

The requirement to always take writes even under partial failure conditions is 
a prime motivator for the development of eventually consistent systems. However 
while those systems can accept writes under a wider range of failure conditions 
than others, like HBase, which strive for consistency, they cannot guarantee 
those writes are immediately available for reads. Far from it. The guarantees 
about data availability and freshness are reduced or eliminated in eventually 
consistent designs. Consistent semantics remain highly desirable even though we 
have to make availability tradeoffs. Eventually consistent designs expose data 
inconsistency issues to their applications, and this is a constant pain point 
for even the best developers. We want to retain HBase's consistent semantics 
and operational model for the vast majority of use cases. That said, we can 
look at some changes that improve the apparent availability of an HBase cluster 
for a subset of simple mutation requests, for use cases that want to relax some 
guarantees for writes in a similar manner as we have done earlier for reads via 
the read replica feature.

If we accept the requirement to always accept writes, if any server is 
available, and there is no need to make them immediately visible, we can 
introduce a new write request attribute that says "it is fine to accept this on 
behalf of the now or future region holder, in a store-and-forward manner", for 
a subset of possible write operations: Append and Increment requires the server 
to return the up-to-date result, so are not possible. CheckAndXXX operations 
likewise must be executed at the primary. Deletes could be dangerous to apply 
out of order and so should not be accepted as a rule. Perhaps simple deletes 
could be supported, if an additional safety valve is switched off in 
configuration, but not DeleteColumn or DeleteFamily. Simple Puts and multi-puts 
(batch Put[] or RowMutations) can be safely serviced. This can still satisfy 
requirements for dogged persistence of simple writes and benefit a range of use 
cases. Should the primary region be unavailable for accepting this subset of 
writes, the client would contact another regionserver, any regionserver, with 
the new operation flag set, and that regionserver would then accept the write 
on behalf of the future holder of the in-transition region. (Technically, a 
client could set this flag at any time for any reason.) Regionservers to which 
writes are handed off must then efficiently and doggedly drain their 
store-and-forward queue. This queue must be durable across process failures and 
restarts. We can use existing regionserver WAL facilities to support this. This 
would be similar in some ways to how cross cluster replication is implemented. 
Edits are persisted to the WAL at the source, the WAL entries are later 
enumerated and applied at the sink. The differences here are:
 * Regionservers would accept edits for regions they are not currently 
servicing.
 * WAL replay must also handle these "martian" edits, adding them to the 
store-and-forward queue of the regionserver recovering the WAL.
 * Regionservers will queue such edits and apply them to the local cluster 
instead of shipping them out; in other words, the local regionserver acts as a 
replication sink, not a source.

There could be no guarantee on the eventual visibility of requests accepted in 
this manner, and writes accepted into store-and-forward queues may be applied 
out of order, although the timestamp component of HBase keys will ensure 
correct temporal ordering (within the effective resolution of the server 
clocks) after all are eventually applied. This is consistent with the semantics 
one gets with eventually consistent systems. This would not be default 
behavior, nor default semantics. HBase would continue to trade off for 
consistency, unless this new feature/flag is enabled by an informed party. This 
is consistent with the strategy we adopted for region replicas.
 
 Like with cross-cluster replication we would want to provide metrics on the 
depth of the queues and maximum age of entries in these queues, so operators 
can get a sense of how far behind they might be, to determine compliance with 
application service level objectives.

Implementation of this feature should satisfy compatibility policy constraints 
such that minor releases can accept it. At the very least we would require it 
in a new branch-1 minor. This is a hard requirement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to