Hi everyone,

I would like to initiate a discussion regarding the current bookie force
reschedule auditor tasks. Below is the detailed BP content. If you have any
questions or ideas, please feel free to reply to this email for further
discussion. Thank you!

This is the master ticket for tracking BP-63 :
Proposal PR - #3964 <https://github.com/apache/bookkeeper/pull/3964>
Motivation

Currently, the Bookie can reschedule Auditor check tasks in several ways,
excluding the auditorBookieTask as it provides a separate mechanism to
trigger task reexecution. This BP specifically discusses
AuditorCheckAllLedgersTask/AuditorPlacementPolicyCheckTask/AuditorReplicasCheckTask:

1: The Bookie provides three execution times based on ZooKeeper,
checkallledgersctime/placementpolicycheckctime/replicascheckctime. By
updating these execution times, we can dynamically adjust the execution
frequency of auditor tasks, but it requires restarting the Auditor process
or reopening the Auditor election to trigger task execution.

2: By using the ForceAuditorChecksCmd tool, which is still based on the
underlying logic of the first point, restarting the Auditor or performing
an election is also necessary to trigger task execution.

3: The Decommission and RecoveryBookie tools tend to focus on executing
recovery logic and only check and recover a specific subset of Bookie
services.

The above methods are complex and have poor stability when rescheduling the
Auditor check tasks in a cluster.
Proposal

Therefore, I propose further optimizing the rescheduling of Auditor tasks.

1: The Auditor monitors the persistent znode path
/ZK_LEDGERS_ROOT_PATH/underreplication/scheduleAuditor.
2: Users modify the task ctime using the ForceAuditorChecksCmd tool and
forcefully create the above znode path using the force parameter.
3: The Auditor creates callbacks through scheduleAuditor to reschedule the
aforementioned three tasks.
4: After the Auditor completes rescheduling the tasks, the scheduleAuditor
node is deleted.
5: When the Auditor starts, it deletes the old scheduleAuditor node to
avoid logical confusion.

This way, we can trigger the scheduling and execution of Auditor tasks
through an online interface without relying on service restart or
re-election.
Compatibility, Deprecation, and Migration Plan

There are no compatibility issues. This BP introduces a new trigger flag
that does not affect the original logic and does not involve any changes to
other existing public APIs. There is no deprecation or migration plan.


Best regards,

Wenbing Shen

Reply via email to