keith-turner commented on code in PR #4524:
URL: https://github.com/apache/accumulo/pull/4524#discussion_r1597510917


##########
core/src/main/java/org/apache/accumulo/core/fate/user/UserFateStore.java:
##########
@@ -150,6 +150,39 @@ protected void create(FateId fateId, FateKey fateKey) {
         + " and fateKey " + fateKey + " after " + maxAttempts + " attempts");
   }
 
+  @Override
+  public Optional<FateTxStore<T>> tryReserve(FateId fateId) {
+    // TODO 4131 should this throw an exception if the id doesn't exist 
(status = UNKNOWN)?
+    FateMutator.Status status = 
newMutator(fateId).putReservedTx(fateId).tryMutate();

Review Comment:
   This code needs to be refactored to use a UUID similar to the zookeeper 
code.  The UUID covers the  problem of a write making it the server, but the 
client not knowing the write made it.  Both zookeeper and accumulo can have 
this problem and both can have a similar solution.  The following examples show 
how the UUID can help address this problem.
   
    1. FATE1 exists and is not currently reserved.
    2. Thread1 attempts reserve FATE1 by writing `isreserved` only if current 
value is `notreserved`.
    3. Thread2 attempts reserve FATE1 by writing `isreserved` only if current 
value is `notreserved`.
    4. One of the writes succeeds and the other fails.  However before the 
tablet server can respond to Thread1 or Thread2, it dies.  This causes both 
thread to receive an unknown status.
    5. After getting the UNKNOWN status, Thread1 reads FATE1 and sees it marked 
reserved however it does not know if it reserved FATE1 or another thread.
    6. After getting the UNKNOWN status, Thread2 reads FATE1 and sees it marked 
reserved however it does not know if it reserved FATE1 or another thread.
   
   In the above example, FATE1 is marked reserved but no thread will be working 
on it.  If instead of setting a value of `isreserved` we instead set a per 
thread UUID, then the above situation could look like the following.
   
    1. FATE1 exists and is not currently reserved. Its reserved column has a 
value of empty string.
    2. Thread1 wants to reserve FATE1 and generates UUID1 for this purpose
    3. Thread2 wants to reserve FATE1 and generates UUID2 for this purpose
    4. Thread1 attempts reserve FATE1 by writing UUID1 as a value for the 
reserved column only if current value is empty.
    5. Thread2 attempts reserve FATE1 by writing UUID2 as a value for the 
reserved column only if current value is empty.
     6. The write for Thread2 succeeds and the write for Thread1 fails because 
the col val is not empty.  However before the tablet server can respond to 
Thread1 or Thread2, it dies.  This causes both thread to receive an unknown 
status.
     7.  After getting the UNKNOWN status, Thread1 reads FATE1 reserved column 
and sees UUID2 so it knows it did not reserve it.
     8. After getting the UNKNOWN status, Thread2 reads FATE1 reserved column 
and sees UUID2 so it knows it did reserve it.
   
   In addition to the UUID mentioned above, the persisted reservation will need 
another field and that is a process lock id. The UUID mentioned above is scoped 
very narrowly, it scoped to an individual reservation attempt and its used to 
handle the case of server applying write and then dying.  The process lock id 
would be used to detect reservation that are held by a process that is no 
longer alive.  The following is an example of how all of this works together.
   
    1. A manager process starts in Accumulo and acquires a lock in ZK, let 
calls it PL1
    2. The manager process reserves FATE1 setting the reservation column to 
UUID1:PL1.  UUID1 is scoped to this reservation thread/code block as mentioned 
above.
    3. After successfully reserving (could have seen unknown status and check 
for UUID1) the manager process dies
    4. A new manager process is started and acquires a lock in ZK, let calls it 
PL2
    5. The new manager process starts FATE, which starts a periodic task to 
look for reservations held by dead process.  The periodic task sees FATE1 is 
reserved by PL1 which no longer has lock in ZK and is therefore presumed dead.
    6. The periodic task issues a conditional mutation (or conditional update 
for ZK) to delete the reserved value for FATE1 only if the currently value is 
UUID1:PL1
   
   Taking all of this together, may want to do the following.
   
    1. Create a new type that represent the reservation value, maybe like 
        ```java
              class FateReseveration {
                     UUID reservationId;
                     ZooUtil.LockID processID; // TODO not sure if this is the 
best type for this
              }
        ```
     2. Use the above type internally in the User and Meta Fatestores for their 
reservations
     3. Add methods to FateStore like `Stream<FateReservation> 
getActiveReservations()` and `void deleteReservation(FateReservation)`
     4. Add a periodic task to Fate that calls the two methods above and looks 
active reservations held by dead processes and then deletes them.
     5. When creating a new FateStore instance, pass in the lock of the process 
where it is running.  For now this would mean when the manager creates a new 
FateStore instance it will pass in its lock. This will be used to set the 
processID on reservations.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to