Re: Hmm, here is an example. Re: [Ocfs2-users] Also just a comment to theOracle guys

Luis Freitas Sun, 11 Feb 2007 08:55:49 -0800

Alexei,
 
      I think you got a point too, maybe OCFS2 could behave like Netapp, and 
simply hang when there is a problem and leave the fencing for CRS or whathever 
other clusterware is in use.
 
      Anyone from Oracle got a opinion on this?
 
 Regards,
 Luis


Alexei_Roudnev <[EMAIL PROTECTED]> wrote:     Absolutely. I know how redo and 
RAC interacts, you are  absolutely correct.
  
 Sometimes CSSD reboots one node and that's all - good luck.  Sometimes OCFS 
reboots one node and CSSD reboots another node - bad luck. That's  why it is 
important do not mix different cluster managers on the same servers,  or at 
least allow them to interact and make similar  decision _who is  master today_ 
(so who will survive split-brain situation).
  
 RAC is little simple case because Oracle is usually primary  service - so if 
it decide to reboot, it's reasonable decision. OCFSv2 is another  story - 
sometimes
 it is _secondary_ service (for example, it is used for the  backups only), and 
if it is secondary then it should better stop working then  reboot.
  
 It reveals 2 big problems (both, Oracle and OCFSv2, are  affected):
 - single interface heartbeat is not reliable. You CAN NOT  build reliable 
cluster using single heartbeat channel. Classical clusters  (Veritas VCS)
 uses 2 - 4 different heartbeat media (we use 2 independent  Ethernet HUBS and 
2 generic Ethernet LAN-s in Veritas, use 2 Ethernets + 1  Serial in Linux 
clusters,
 use 2 Ethernet + 1 Serial in Cisco PIX cluster, and so on).  Both OCFS and 
Oracle RAC can not use more then one (to be correct, you can  configure few 
interfaces
 for RAC interconnection in SPFile, but it wil not affect  CSSD). In addition, 
OCFS defaults are very strange and unrealistic - Ethernet  and FC can not 
guarantee heartbeat times better than 1 minute, in average (I  mean - it case 
of any network reconfiguration heartbeat wil experience 30 - 50  seconds delay, 
so if you configure 12 seconds timeout /default in OCFSv2/ you  are at least 
naive.
  
 - too easy _self fencing_. Just again, if OCFS node lost  connection to the 
disk, it should not self fence - it can send data to another  nodes (or request 
them from
 another nodes), it can unmount file system and try to remount  it, it can 
release control and resume operations. Immediate fencing is necessary  in SOME 
cases
 but not in all. If FS have not pending operations, then  Fencing by reboot 
don't make much difference with just _remount_. It's not so  simple as I 
explain here,
 but the truth is that fencing decisions are not flexible  enough and decrease 
reliability dramatically (I posted a list of scenarios when  fencing should not 
happen).
  
 IN addition, I noticed other problems with OCFSv2 too (such as  excessive CPU 
usage in some cases).
  
 I use OCFSv2, even in production. But I do it with a grain of  salt, have a 
backup plan _how to run without it_, and don't use it for heavily  loaded
 file systems with million files (I use  heartbeat, reiserfs and APC switch 
fencing - and 3 independent heartbeats,  with 40 seconds timeout). For now, I 
had one glitch on OCFSv2 (when it remounted  read only on one node) and that's 
all - no other problems  in production  (OCFSv2 is used during start/stops 
only, so it is safe). But I run stress  tests in the lab, I am running it in 
the lab clusters now (including RAC),  and conclusion is simple - as a cluster, 
it is not reliable; as a file  system, it may have hidden bugs so be extra 
careful with it.
  
 PS. Good point - it improves every month. Some problems are in  the past 
already. 
  
 PPS. All this lab reboots have been caused by extremely heavy  load or by 
hardware failures (simulated or real). It works better in real life.  But my 
experience says me, that if I can break something in the lab in 3 days,  it's a 
matter of few month, when it broke in production.
  
    ----- Original Message ----- 
   From:    Luis    Freitas 
   To: [email protected] 
   Sent: Saturday, February 10, 2007 4:52    PM
   Subject: Re: Hmm,here is an example. Re:    [Ocfs2-users] Also just a 
comment to theOracle guys
   

   Alexei,
    
       Actually your log seems to show that CSSD (Oracle    CRS) rebooted the 
node before OCFS2 got a chance to do it.
    
       On a RAC cluster, if the interconnect is    interrupted, all the nodes 
hang until a split brain resolution is    complete and the recovery of all the 
crashed nodes is completed. This is    needed because every read on a Oracle 
datablock needs a ping to the other    nodes. 
    
       The view of the data must be consistent, when one node    read a 
particular data block, the Oracle Database first ping the other    nodes to 
ensure that they did not modify the block and still have not flushed    it to 
disk. Another node may even forward a reply with the block,    preventing the 
disk access (Cache Fusion). 
    
       When a split brain occurs, there is the loss of these    blocks not 
flushed to disk, and they are rebuilt using the redo threads of the    
particular nodes that crashed. During this interval all the database    
instances "freeze", since before the node recovery is complete there is no way  
  to guarantee that a block read from disk has not been altered on the    
crashed node.
    
       So the fencing is needed even if there is no disk    activity, as the 
entire cluster becomes "hang" the moment the interconnect is    down. And the 
timeout for the fencing must be as small as possible to prevent    a long 
cluster reconfiguration delay. Of course the timeout must be tuned so    as to 
be larger than ethernet switch failovers, or storage controller or    disk 
multipath failovers. Or if possible the failover times should be    reduced.
    
      Now, on the other hand, I am too having problems with OCFS2.    It seems 
much less robust than ASM and the previous version, OCFS, specially    under 
heavy disk activity. But I do expect these problems to get solved in the    
near future, as did the 2.4 kernel VM problems.
    
   Regards,
   Luis
   
Alexei_Roudnev <[EMAIL PROTECTED]>    wrote:
                   Additional info - node had not ANY active      OCFSv2 
operations (OCFSv2 used for backups only and from another node only).      So, 
if system just SUSPEND all FS operations and try to rejoin to the      cluster, 
it all could work (moreover, connection to the disk system was      intact, so 
it could close file sytem gracefully).
      
     It reveals 3 problems at once:
     - single heartbeat link (instead of multiple      links)
     - timeout too short (ethernet can't guarantee      10 seconds, it can 
guarantee 1 minute minimum);
     - fencing even if system is passive and can      remount / reconnect 
instead of rebooting.
      
     All we did in the lab was _disconnect 1 of      trunks between switches 
for a few seconds, then insert it back into the      socket_. No one other 
application failed
     (including heartbeat clusters). Database      cluster was not doing 
anything on OCFS in time of failure (even      backups).
      
     I will try heartbeat between loopback      interfaces (and OCFS protocol) 
next time (I am just curios if it can provide      10 seconds for network 
reconfiguration).
      
     ...
     Feb  1 12:19:13      testrac12 kernel: o2net: connection to node testrac11 
(num 0) at      10.254.32.111:7777 has been idle for 10 seconds,      shutting 
it down. 
Feb  1 12:19:13 testrac12 kernel:      (13,3):o2net_idle_timer:1310 here are 
some times that might help debug the      situation: (tmr 1170361135.521061 now 
1170361145.520476 dr 1170361141.852795      adv 
1170361135.521063:1170361135.521064 func (c4378452:505)      
1170361067.762941:1170361067.762967) 
Feb  1 12:19:13 testrac12      kernel: o2net: no longer connected to node 
testrac11 (num 0) at      10.254.32.111:7777 
Feb  1 12:19:13 testrac12 kernel:      
(1855,3):dlm_send_remote_convert_request:398 ERROR: status = -107 
Feb       1 12:19:13 testrac12 kernel: (1855,3):dlm_wait_for_node_death:371     
 5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting 5000ms for notification of death     
 of node 0 
Feb  1 12:19:13 testrac12 kernel:      
(1855,1):dlm_send_remote_convert_request:398 ERROR: status = -107 
Feb       1 12:19:13 testrac12 kernel: (1855,1):dlm_wait_for_node_death:371     
 5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting 5000ms for notification of death     
 of node 0 
Feb  1 12:22:22 testrac12 kernel:      
(1855,2):dlm_send_remote_convert_request:398 ERROR: status = -107 
Feb       1 12:22:22 testrac12 kernel: (1855,2):dlm_wait_for_node_death:371     
 5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting 5000ms for notification of death     
 of node 0 
Feb  1 12:22:27 testrac12 kernel:      (13,3):o2quo_make_decision:144 ERROR: 
fencing this node because it is      connected to a half-quorum of 1 out of 2 
nodes which doesn't include the      lowest active node 0 
Feb  1 12:22:27 testrac12 kernel:      (13,3):o2hb_stop_all_regions:1889 ERROR: 
stopping heartbeat on all active      regions. 
Feb  1 12:22:27 testrac12 kernel: Kernel panic: ocfs2 is      very sorry to be 
fencing this system by panicing 
Feb  1 12:22:27      testrac12 kernel: 
Feb  1 12:22:28 testrac12 su: pam_unix2: session      finished for user oracle, 
service su 
Feb  1 12:22:29 testrac12      logger: Oracle CSSD failure.  Rebooting for 
cluster integrity. 
Feb       1 12:22:32 testrac12 su: pam_unix2: session finished for user oracle, 
     service su      
...
_______________________________________________
Ocfs2-users      mailing      list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users
      

---------------------------------
   Expecting? Get great news right away with email    Auto-Check.
Try the Yahoo!    Mail Beta.      

---------------------------------
    
_______________________________________________
Ocfs2-users mailing    list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

 
---------------------------------
We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: Hmm, here is an example. Re: [Ocfs2-users] Also just a comment to theOracle guys

Reply via email to