Looks good to me. Thanks for the patch

Reviewed-by: Srinivas Eeda <srinivas.e...@oracle.com>

On 09/15/2014 10:15 PM, Junxiao Bi wrote:
> Firing quorum before connection established can cause unexpected node to 
> reboot.
> Assume there are 3 nodes in the cluster, Node 1, 2, 3. Node 2 and 3 have
> wrong ip address of Node 1 in cluster.conf and global heartbeat is enabled
> in the cluster. After the heatbeat are started on these three nodes, Node 1
> will reboot due to quorum fencing. It is similar case if Node 1's networking
> is not ready when starting the global heatbeat.
> The reboot is not friendly as customer is not fully ready for ocfs2 to work.
> Fix it by not allow firing quorum before connection established. In this
> case, ocfs2 will wait until wrong configure fixed or networking up to 
> continue.
> Also update the log to guide user where to check when connection is not built
> for a long time.
>
> Signed-off-by: Junxiao Bi <junxiao...@oracle.com>
> Reviewed-by: Srinivas Eeda <srinivas.e...@oracle.com>
> ---
>   fs/ocfs2/cluster/tcp.c |    5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
> index ea34952..b2cc010 100644
> --- a/fs/ocfs2/cluster/tcp.c
> +++ b/fs/ocfs2/cluster/tcp.c
> @@ -536,7 +536,7 @@ static void o2net_set_nn_state(struct o2net_node *nn,
>       if (nn->nn_persistent_error || nn->nn_sc_valid)
>               wake_up(&nn->nn_sc_wq);
>   
> -     if (!was_err && nn->nn_persistent_error) {
> +     if (was_valid && !was_err && nn->nn_persistent_error) {
>               o2quo_conn_err(o2net_num_from_nn(nn));
>               queue_delayed_work(o2net_wq, &nn->nn_still_up,
>                                  msecs_to_jiffies(O2NET_QUORUM_DELAY_MS));
> @@ -1721,7 +1721,8 @@ static void o2net_connect_expired(struct work_struct 
> *work)
>       spin_lock(&nn->nn_lock);
>       if (!nn->nn_sc_valid) {
>               printk(KERN_NOTICE "o2net: No connection established with "
> -                    "node %u after %u.%u seconds, giving up.\n",
> +                    "node %u after %u.%u seconds, check network and"
> +                    " cluster configuration.\n",
>                    o2net_num_from_nn(nn),
>                    o2net_idle_timeout() / 1000,
>                    o2net_idle_timeout() % 1000);


_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Reply via email to