Hi everyone, I would like to initiate a discussion regarding the current bookie force reschedule auditor tasks. Below is the detailed BP content. If you have any questions or ideas, please feel free to reply to this email for further discussion. Thank you!
This is the master ticket for tracking BP-63 : Proposal PR - #3964 <https://github.com/apache/bookkeeper/pull/3964> Motivation Currently, the Bookie can reschedule Auditor check tasks in several ways, excluding the auditorBookieTask as it provides a separate mechanism to trigger task reexecution. This BP specifically discusses AuditorCheckAllLedgersTask/AuditorPlacementPolicyCheckTask/AuditorReplicasCheckTask: 1: The Bookie provides three execution times based on ZooKeeper, checkallledgersctime/placementpolicycheckctime/replicascheckctime. By updating these execution times, we can dynamically adjust the execution frequency of auditor tasks, but it requires restarting the Auditor process or reopening the Auditor election to trigger task execution. 2: By using the ForceAuditorChecksCmd tool, which is still based on the underlying logic of the first point, restarting the Auditor or performing an election is also necessary to trigger task execution. 3: The Decommission and RecoveryBookie tools tend to focus on executing recovery logic and only check and recover a specific subset of Bookie services. The above methods are complex and have poor stability when rescheduling the Auditor check tasks in a cluster. Proposal Therefore, I propose further optimizing the rescheduling of Auditor tasks. 1: The Auditor monitors the persistent znode path /ZK_LEDGERS_ROOT_PATH/underreplication/scheduleAuditor. 2: Users modify the task ctime using the ForceAuditorChecksCmd tool and forcefully create the above znode path using the force parameter. 3: The Auditor creates callbacks through scheduleAuditor to reschedule the aforementioned three tasks. 4: After the Auditor completes rescheduling the tasks, the scheduleAuditor node is deleted. 5: When the Auditor starts, it deletes the old scheduleAuditor node to avoid logical confusion. This way, we can trigger the scheduling and execution of Auditor tasks through an online interface without relying on service restart or re-election. Compatibility, Deprecation, and Migration Plan There are no compatibility issues. This BP introduces a new trigger flag that does not affect the original logic and does not involve any changes to other existing public APIs. There is no deprecation or migration plan. Best regards, Wenbing Shen