[jira] [Commented] (RATIS-556) Detect node failures and close the log to prevent additional writes

Rajeshbabu Chintaguntla (Jira) Thu, 22 Aug 2019 07:41:48 -0700


    [ 
https://issues.apache.org/jira/browse/RATIS-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913380#comment-16913380
 ]


Rajeshbabu Chintaguntla commented on RATIS-556:
-----------------------------------------------

bq.what is and who will pass this origin peer? Why can't we create an inverted 
index on below the existing map to get logs hosted on a peer?
User needs to pass this [[email protected]]. If we just create an inverted 
index then a log will be served by 3 peers and when any of the peers goes down 
we need to close the log. But if we take HBase use case a log will be created 
by a server and from the logname, we can detect the server but such 
functionality cannot be done in this generic log service. When the log 
replicated until unless the main server created the log wont be recovered. 
We can handle such use case at least by [passing the peer and we can close the 
log only when the peer goes down.

bq.And , Is notifySlowness() the right API to declare node as dead? can't we 
use the heartbeat mechanism like every peer will be sending heartbeat request 
regularly to meta quorum using a separate thread? (or it will create a storm of 
heart beat request and congestion at meta quorum?)
notifySlowness atleast gives failure case I tried to detect the failed node 
from commit info of the group we can detect but that's not working so as you 
mentioned introduced heart beat mechanism and when a peer doesn't send the 
heart beat for particular period then closing the log served by the peer.

Uploded v1 patch handling the same.

Few points need to be handled are:
1) Need to add configurations for time interval to send the heartbeat and 
period when we need to consider node failed.
2) formatting at some places.

[~elserj] [[email protected]] Please review v1 patch.

> Detect node failures and close the log to prevent additional writes
> -------------------------------------------------------------------
>
>                 Key: RATIS-556
>                 URL: https://issues.apache.org/jira/browse/RATIS-556
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Rajeshbabu Chintaguntla
>            Assignee: Rajeshbabu Chintaguntla
>            Priority: Major
>         Attachments: RATIS-556-wip.patch, RATIS-556_v1.patch
>
>
> Currently there is no way to detect the node failures at master log servers 
> and add new nodes to the group serving the log. We need to analyze how Ozone 
> is working in this case.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (RATIS-556) Detect node failures and close the log to prevent additional writes

Reply via email to