[
https://issues.apache.org/jira/browse/RATIS-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913380#comment-16913380
]
Rajeshbabu Chintaguntla commented on RATIS-556:
-----------------------------------------------
bq.what is and who will pass this origin peer? Why can't we create an inverted
index on below the existing map to get logs hosted on a peer?
User needs to pass this [[email protected]]. If we just create an inverted
index then a log will be served by 3 peers and when any of the peers goes down
we need to close the log. But if we take HBase use case a log will be created
by a server and from the logname, we can detect the server but such
functionality cannot be done in this generic log service. When the log
replicated until unless the main server created the log wont be recovered.
We can handle such use case at least by [passing the peer and we can close the
log only when the peer goes down.
bq.And , Is notifySlowness() the right API to declare node as dead? can't we
use the heartbeat mechanism like every peer will be sending heartbeat request
regularly to meta quorum using a separate thread? (or it will create a storm of
heart beat request and congestion at meta quorum?)
notifySlowness atleast gives failure case I tried to detect the failed node
from commit info of the group we can detect but that's not working so as you
mentioned introduced heart beat mechanism and when a peer doesn't send the
heart beat for particular period then closing the log served by the peer.
Uploded v1 patch handling the same.
Few points need to be handled are:
1) Need to add configurations for time interval to send the heartbeat and
period when we need to consider node failed.
2) formatting at some places.
[~elserj] [[email protected]] Please review v1 patch.
> Detect node failures and close the log to prevent additional writes
> -------------------------------------------------------------------
>
> Key: RATIS-556
> URL: https://issues.apache.org/jira/browse/RATIS-556
> Project: Ratis
> Issue Type: Improvement
> Reporter: Rajeshbabu Chintaguntla
> Assignee: Rajeshbabu Chintaguntla
> Priority: Major
> Attachments: RATIS-556-wip.patch, RATIS-556_v1.patch
>
>
> Currently there is no way to detect the node failures at master log servers
> and add new nodes to the group serving the log. We need to analyze how Ozone
> is working in this case.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)