Because of instance locks, after submitting a repair job we weren't able to
set the "pending" tag until at least the first opcode of the job finished.
Introduce a small delay in the repair job so as to allow the subsequent
TAGS_SET job to finish immediately, resulting in faster operation from the
user's point of view.

Signed-off-by: Dato Simó <d...@google.com>
---
 src/Ganeti/HTools/Program/Harep.hs | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/src/Ganeti/HTools/Program/Harep.hs 
b/src/Ganeti/HTools/Program/Harep.hs
index de09c8b..2ede5f2 100644
--- a/src/Ganeti/HTools/Program/Harep.hs
+++ b/src/Ganeti/HTools/Program/Harep.hs
@@ -373,9 +373,34 @@ doRepair client instData (rtype, opcodes) =
       else do
         putStrLn ("Executing " ++ show rtype ++ " repair on " ++ iname)
 
+        -- After submitting the job, we must write an autorepair:pending tag,
+        -- that includes the repair job IDs so that they can be checked later.
+        -- One problem we run into is that the repair job immediately grabs
+        -- locks for the affected instance, and the subsequent TAGS_SET job is
+        -- blocked, introducing an unnecesary delay for the end-user. One
+        -- alternative would be not to wait for the completion of the TAGS_SET
+        -- job, contrary to what commitChange normally does; but we insist on
+        -- waiting for the tag to be set so as to abort in case of failure,
+        -- because the cluster is left in an invalid state in that case.
+        --
+        -- The proper solution (in 2.9+) would be not to use tags for storing
+        -- autorepair data, or make the TAGS_SET opcode not grab an instance's
+        -- locks (if that's deemed safe). In the meantime, we introduce an
+        -- artificial delay in the repair job (via a TestDelay opcode) so that
+        -- once we have the job ID, the TAGS_SET job can complete before the
+        -- repair job actually grabs the locks. (Please note that this is not
+        -- about synchronization, but merely about speeding up the execution of
+        -- the harep tool. If this TestDelay opcode is removed, the program is
+        -- still correct.)
+        let opcodes' = OpTestDelay { opDelayDuration = 10
+                                   , opDelayOnMaster = True
+                                   , opDelayOnNodes = []
+                                   , opDelayRepeat = fromJust $ mkNonNegative 0
+                                   } : opcodes
+
         uuid <- newUUID
         time <- getClockTime
-        jids <- execJobs [map wrapOpCode opcodes] client
+        jids <- execJobs [map wrapOpCode opcodes'] client
 
         case jids of
           Bad e    -> exitErr e
-- 
1.8.0.2-x20-1

Reply via email to