[ 
https://issues.apache.org/jira/browse/HBASE-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358203#comment-17358203
 ] 

Andrew Kyle Purtell edited comment on HBASE-24440 at 6/6/21, 6:42 PM:
----------------------------------------------------------------------

In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is multiple rows, so 
therefore the region (strictly speaking, it is the union set of those rows, I 
will come back to this later), and serializing commits to a region is 
expensive. With BoundedIncrementYieldAdvancingClock that means the region can 
take a burst of commits within a single clock tick up to the limit and then 
there is a performance discontinuity while we spin until the system time 
catches up (which can be up to 1 second). With YieldAdvancingClock, the region 
can take only one commit for an entire clock tick, which is way worse. This 
should be managed at the row scope, not the region scope, but then what to do 
about batch mutations? I struggled for a while with various forms of 
hierarchical clocks where one might get a clock for the region, then use that 
clock to get child clocks for a row, and was not successful because they always 
ended up serializing _something_ at region scope, but then reconsidered... What 
is it we are really tracking? We are tracking whether or not a row has already 
been committed to in the current clock tick. Or, _we are tracking per region a 
set of rows per clock tick_. Why not track that explicitly? So this idea is 
being explored with HBASE-25975, which does that with a region level data 
structure based on atomics because access to it will be highly multithreaded. 
In some ways it is much simpler in retrospect, but may become more complex when 
considering optimizations. 


was (Author: apurtell):
In HBASE-25913 I considered serialization of commits within a given clock tick 
using the system clock, by extending the interface EnvironmentEdge, which 
already dealt with time. Consider an API for getting a Clock, by name, from a 
concurrent map of such things, and then using this clock to get _advancing 
time_, which is the current system time, like System.currentTimeMillis, but 
with an additional semantic: When you call getCurrentAdvancing() you will 
always get a value that has advanced forward from the last time someone looked 
at this particular clock's time. The particular implementation of Clock is free 
to maintain that invariant with different strategies, and I explored several 
different implementations:
- IncrementAdvancingClock: like a poor man's hybrid clock, if the system time 
hasn't advanced, manually advance it, and ensure other API calls to the same 
clock don't give a different, earlier time
- SpinAdvancingClock: if the system time hasn't advanced, spin until it does
- YieldAdvancingClock: if the system time hasn't advanced, yield the thread 
with a small sleep until it does
- BoundedIncrementSpinAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then spin until the system time catches up
- BoundedIncrementYieldAdvancingClock: advance ahead of system time up to some 
bound (like 1 second) but then yield with a small sleep until the system time 
catches up

These implementations were microbenchmarked to identify the superior option, 
which seems to be BoundedIncrementYieldAdvancingClock. 

The main issue with HBASE-25913 is serializing commits with the act of getting 
the time requires that everything that might mutate the resource you are trying 
to protect must be within that scope. For a single row update the scope would 
be a row, and highly granular, so performance will not be affected that much. 
For batch mutations, which are typical, the scope is the region, and 
serializing commits to a region is expensive. With 
BoundedIncrementYieldAdvancingClock that means the region can take a burst of 
commits within a single clock tick up to the limit and then there is a 
performance discontinuity while we spin until the system time catches up (which 
can be up to 1 second). With YieldAdvancingClock, the region can take only one 
commit for an entire clock tick, which is way worse. This should be managed at 
the row scope, not the region scope, but then what to do about batch mutations? 
I struggled for a while with various forms of hierarchical clocks where one 
might get a clock for the region, then use that clock to get child clocks for a 
row, and was not successful because they always ended up serializing 
_something_ at region scope, but then reconsidered... What is it we are really 
tracking? We are tracking whether or not a row has already been committed to in 
the current clock tick. Or, _we are tracking per region a set of rows per clock 
tick_. Why not track that explicitly? So this idea is being explored with 
HBASE-25975, which does that with a region level data structure based on 
atomics because access to it will be highly multithreaded. In some ways it is 
much simpler in retrospect, but may become more complex when considering 
optimizations. 

> Prevent temporal misordering on timescales smaller than one clock tick
> ----------------------------------------------------------------------
>
>                 Key: HBASE-24440
>                 URL: https://issues.apache.org/jira/browse/HBASE-24440
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0
>
>
> When mutations are sent to the servers without a timestamp explicitly 
> assigned by the client the server will substitute the current wall clock 
> time. There are edge cases where it is at least theoretically possible for 
> more than one mutation to be committed to a given row within the same clock 
> tick. When this happens we have to track and preserve the ordering of these 
> mutations in some other way besides the timestamp component of the key. Let 
> me bypass most discussion here by noting that whether we do this or not, we 
> do not pass such ordering information in the cross cluster replication 
> protocol. We also have interesting edge cases regarding key type precedence 
> when mutations arrive "simultaneously": we sort deletes ahead of puts. This, 
> especially in the presence of replication, can lead to visible anomalies for 
> clients able to interact with both source and sink. 
> There is a simple solution that removes the possibility that these edge cases 
> can occur: 
> We can detect, when we are about to commit a mutation to a row, if we have 
> already committed a mutation to this same row in the current clock tick. 
> Occurrences of this condition will be rare. We are already tracking current 
> time. We have to know this in order to assign the timestamp. Where this 
> becomes interesting is how we might track the last commit time per row. 
> Making the detection of this case efficient for the normal code path is the 
> bulk of the challenge. One option is to keep track of the last locked time 
> for row locks. (Todo: How would we track and garbage collect this efficiently 
> and correctly. Not the ideal option.) We might also do this tracking somehow 
> via the memstore. (At least in this case the lifetime and distribution of in 
> memory row state, including the proposed timestamps, would align.) Assuming 
> we can efficiently know if we are about to commit twice to the same row 
> within a single clock tick, we would simply sleep/yield the current thread 
> until the clock ticks over, and then proceed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to