> -----Original Message----- > From: [email protected] [mailto:ocfs2-users- > [email protected]] On Behalf Of Pedro Figueira > Sent: 02/03/2009 09:07 > To: [email protected] > Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida > Subject: [Ocfs2-users] strange node reboot in RAC environment > > Hi all > > We have a 4 Oracle RAC with the following versions of software > versions: > > Oracle and clusterware version 10.2.0.4 > Red Hat Enterprise Linux AS release 4 with kernel version 2.6.9- > 55.ELlargesmp > ocfs2-tools-1.2.4-1 > ocfs2-2.6.9-55.ELlargesmp-1.2.5-2 > ocfs2console-1.2.4-1 > timeout parameters: > Heartbeat dead threshold: 31 > Network idle timeout: 10000 > Network keepalive delay: 5000 > Network reconnect delay: 2000 > > Until later last year the cluster was rock solid (hundreds). From > January forward all the servers started to reboot synchronized but the > strange thing is that there are no log messages in /var/log/messages, > so we don't know if this a ocfs2 related problem. This reboots seems be > related with the backup process (maybe extra load?). Other reboots only > affect 2 out of 4 nodes.
As ocfs2 will print out messages to the console and they might not get capture by anything, I recommend to setup the virtual serial of iLO and use something like conserver to attach a console to that virtual serial. I do this for all our OCFS hosts and have a log of anything going on, including BIOS screen. If ocfs2 is fencing because of I/O issues, it will show there. > > Last night we updated the firmware and drivers from HP of the DL580G4 > server and today we had another reboot (now with the following messages > in /var/log/messages): > > NODE 1: > ------------------------------------------------------ > Feb 3 14:12:52 grid2db1 kernel: o2net: connection to node grid2db4 > (num 3) at 10.0.2.52:7777 has been idle for 10.0 seconds, shutting it > down. > Feb 3 14:12:52 grid2db1 kernel: (0,0):o2net_idle_timer:1418 here are > some times that might help debug the situation: (tmr 1233670362.97595 > now 1233670372.96280 dr 1233670362.97580 adv > 1233670362.97604:1233670362.97604 func (c77ed98a:504) > 1233670067.138220:1233670067.138233) > Feb 3 14:12:52 grid2db1 kernel: o2net: no longer connected to node > grid2db4 (num 3) at 10.0.2.52:7777 > Feb 3 14:16:26 grid2db1 syslogd 1.4.1: restart. > Feb 3 14:16:26 grid2db1 syslog: syslogd startup succeeded > > NODE 4: > ------------------------------------------------------ > Feb 3 14:12:46 grid2db4 kernel: (20,2):o2hb_write_timeout:269 ERROR: > Heartbeat write timeout to device sdl after 60000 milliseconds > Feb 3 14:12:46 grid2db4 kernel: Heartbeat thread (20) printing last 24 > blocking operations (cur = 18): > Feb 3 14:16:27 grid2db4 syslogd 1.4.1: restart. > Feb 3 14:16:27 grid2db4 syslog: syslogd startup succeeded > > Other reboots simple don't log any error message. > > So my question is if it's possible this reboots are triggers by OCFS2 > and how to debug this problem? Should I change the timeout parameters? > > We are also planning to upgrade to OCFS2 1.2.9-1 and OCFS2 Tools 1.2.7- > 1 and latest distro kernel, any catch? > > Best regards and thanks for any answer. > > Pedro Figueira > Serviço de Estrangeiros e Fronteiras > Direcção Central de Informática > Departamento de Produção > Telefone: + 351 217 115 153 > > -----Original Message----- > From: [email protected] [mailto:ocfs2-users- > [email protected]] On Behalf Of Sunil Mushran > Sent: sábado, 31 de Janeiro de 2009 15:59 > To: Carl Benson > Cc: [email protected] > Subject: Re: [Ocfs2-users] one node rejects connection from new node > > Nodes can be added to an online cluster. The instructions are listed > in the user's guide. > > On Jan 31, 2009, at 7:53 AM, Carl Benson <[email protected]> wrote: > > > Sunil, > > > > Thank you for responding. I will try o2cb_ctl on Monday, when I have > > physical access to hit Reset in case one or more nodes lock up. > > > > If there really is a requirement to restart the cluster on wilson1 > > every time > > I add a new node (and I have five or six more nodes to add), that is > > too > > bad. Wilson1 is a 24x7 production system. > > > > --Carl Benson > > > > Sunil Mushran wrote: > >> Could be that the cluster was already online on wilson1 when you > >> propagated the cluster.conf to all nodes. If so, restart the cluster > >> on that node. > >> > >> To add a node to an online cluster, you need to use the o2cb_ctl > >> command. Details are in the 1.4 user's guide. > >> > >> > >> Carl J. Benson wrote: > >> > >>> Hello. > >>> > >>> I have three systems that share an ocfs2 filesystem, and I'm > >>> trying to add a fourth system. > >>> > >>> These are all openSUSE 11.1, x86_64, kernel 2.6.27.7-9-default. > >>> All have RPMs ocfs2-tools-1.4.1-6.9 and ocfs2console-1.4.1-6.9 > >>> > >>> cluster.conf looks like this: > >>> node: > >>> ip_port = 7777 > >>> ip_address = 140.107.170.116 > >>> number = 0 > >>> name = merlot1 > >>> cluster = ocfs2 > >>> > >>> node: > >>> ip_port = 7777 > >>> ip_address = 140.107.158.54 > >>> number = 1 > >>> name = merlot2 > >>> cluster = ocfs2 > >>> > >>> node: > >>> ip_port = 7777 > >>> ip_address = 140.107.158.82 > >>> number = 2 > >>> name = wilson1 > >>> cluster = ocfs2 > >>> > >>> node: > >>> ip_port = 7778 > >>> ip_address = 140.107.170.108 > >>> number = 3 > >>> name = gladstone > >>> cluster = ocfs2 > >>> > >>> cluster: > >>> node_count = 4 > >>> name = ocfs2 > >>> > >>> gladstone is the new node. > >>> > >>> I edited the cluster.conf on wilson1 using ocfs2console, and > >>> propagated it to the other systems from there. > >>> > >>> When I try to bring my ocfs2 online with /etc/init.d/o2cb online > >>> ocfs2, > >>> merlot1 accepts the connection from gladstone, as does merlot2. > >>> However, wilson1 rejects it as an unknown node! For example: > >>> > >>> Jan 30 14:11:46 wilson1 kernel: (4447,3):o2net_accept_one:1795 > >>> attempt > >>> to connect from unknown node at 140.107.170.108:37795 > >>> > >>> Why would this happen? > >>> > >>> > >> > >> > >> _______________________________________________ > >> Ocfs2-users mailing list > >> [email protected] > >> http://oss.oracle.com/mailman/listinfo/ocfs2-users > >> > > > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > CONFIDENCIAL NOTICE: > This message, as well as any existing attached files, is confidential > and intended exclusively for the individual(s) named as addressees. If > you are not the intended recipient, you are kindly requested not to > make any use whatsoever of its contents and to proceed to the > destruction of the message, thereby notifying the sender. > DISCLAIMER: > The sender of this message can NOT ensure the security of its > electronic transmission and consequently does not accept liability for > any fact, which may interfere with the integrity of its content. > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
