[
https://issues.apache.org/jira/browse/BOOKKEEPER-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438856#comment-13438856
]
Rakesh R commented on BOOKKEEPER-272:
-------------------------------------
Thanks again Ivan for detailed review. Could you please give few more info on
the following:
bq.In Auditor, don't call getChildren from process(). Instead all it just after
the take() in main loop. In general, you shouldn't call any blocking methods
from the zk event handler thread
Oh yeah, Thanks for bringing this good point. Correct, this will delay other
watch notifications also.
bq.Logic change in ZkLedgerUnderreplicationManager is wrong [previous was wrong
also]. the break should be a return.
If I understand correctly, on NodeExistsException we are trying to append the
missingReplica to the underreplicated ledger so that will notifies about one
more down bookie which contains the ledger copy.
If we return simply, then the setData() method will not be called and there is
a chance of missing the info about second replica.
For Ex: L00001 ensemble BK1, BK2, BK3.
Say BK1 fails initially, then will markUnderreplicated ledger as L000001(BK1 as
the data).
Now again BK2 has failed, then while creating will get NEE, so we will append
BK2 also like: L000001(BK1 BK2).
I think "break; statement" is making sense and after that the duplicate entry
addition should be removed as per my latest patch.
Am I missing anything?
bq.There should be a main method in AutoRecoveryManager.
Yeah I'll add main method also. But what about retaining start() and stop()
method as public. In future this will allow others(any external entity) to
manage the recovery process easily ?
bq.I think AutoRecoveryManager should be responsible for running recoveryworker
as well.
Ofcourse, I'll integrate RW also be initialized as part of ARM.
bq.auditorElector should never be null, unless initialization fails. If
initialization fails, start and stop should never be run. Perhaps we should us
guava service [1] here as I also suggested to Uma for BOOKKEEPER-248.
I just added null check, since start() and stop() methods are public.
I'll rework on other points.
> Provide automatic mechanism to know bookie failures
> ---------------------------------------------------
>
> Key: BOOKKEEPER-272
> URL: https://issues.apache.org/jira/browse/BOOKKEEPER-272
> Project: Bookkeeper
> Issue Type: Sub-task
> Components: bookkeeper-server
> Reporter: Rakesh R
> Assignee: Rakesh R
> Fix For: 4.2.0
>
> Attachments: BOOKKEEPER-272.1.patch, BOOKKEEPER-272.2.patch,
> BOOKKEEPER-272.3.patch, BOOKKEEPER-272.Auditor.1.patch,
> BOOKKEEPER-272.Auditor.patch
>
>
> The idea is to build automatic mechanism to find out the bookie failures.
> Setup the bookie failure notifications to start the re-replication process.
> There are multiple approaches to findout bookie failures. Please refer the
> documents attached in BookKeeper-237.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira