Because of instance locks, after submitting a repair job we weren't able to set the "pending" tag until at least the first opcode of the job finished. Introduce a small delay in the repair job so as to allow the subsequent TAGS_SET job to finish immediately, resulting in faster operation from the user's point of view.
Signed-off-by: Dato Simó <d...@google.com> --- src/Ganeti/HTools/Program/Harep.hs | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/src/Ganeti/HTools/Program/Harep.hs b/src/Ganeti/HTools/Program/Harep.hs index de09c8b..2ede5f2 100644 --- a/src/Ganeti/HTools/Program/Harep.hs +++ b/src/Ganeti/HTools/Program/Harep.hs @@ -373,9 +373,34 @@ doRepair client instData (rtype, opcodes) = else do putStrLn ("Executing " ++ show rtype ++ " repair on " ++ iname) + -- After submitting the job, we must write an autorepair:pending tag, + -- that includes the repair job IDs so that they can be checked later. + -- One problem we run into is that the repair job immediately grabs + -- locks for the affected instance, and the subsequent TAGS_SET job is + -- blocked, introducing an unnecesary delay for the end-user. One + -- alternative would be not to wait for the completion of the TAGS_SET + -- job, contrary to what commitChange normally does; but we insist on + -- waiting for the tag to be set so as to abort in case of failure, + -- because the cluster is left in an invalid state in that case. + -- + -- The proper solution (in 2.9+) would be not to use tags for storing + -- autorepair data, or make the TAGS_SET opcode not grab an instance's + -- locks (if that's deemed safe). In the meantime, we introduce an + -- artificial delay in the repair job (via a TestDelay opcode) so that + -- once we have the job ID, the TAGS_SET job can complete before the + -- repair job actually grabs the locks. (Please note that this is not + -- about synchronization, but merely about speeding up the execution of + -- the harep tool. If this TestDelay opcode is removed, the program is + -- still correct.) + let opcodes' = OpTestDelay { opDelayDuration = 10 + , opDelayOnMaster = True + , opDelayOnNodes = [] + , opDelayRepeat = fromJust $ mkNonNegative 0 + } : opcodes + uuid <- newUUID time <- getClockTime - jids <- execJobs [map wrapOpCode opcodes] client + jids <- execJobs [map wrapOpCode opcodes'] client case jids of Bad e -> exitErr e -- 1.8.0.2-x20-1