[Linux-ha-dev] Re: [Linux-HA] kernel panic on disk failure

Nate Reed Tue, 31 Jan 2006 06:57:50 -0800

On Tuesday 31 January 2006 07:58, Todd Denniston wrote:

> Both machines panic-ed?
> or did one panic and the other halt with a message of "!DRBD! pri on
> incon-degr"?


That's possible... I don't remember the exact error message but perhaps you're 
right that it was not a panic but it did halt.

> if in your drbd.conf disk{} section you have
> on-io-error panic;
> I would expect the node which lost it's disk to panic and the other to keep
> on going. [I count on this with a couple of mildly reliable Promise RM8000
> arrays, and have seen it do it's job.]

Interesting.... We have on-io-error detach.  Should we change that to "panic"?

>
> As Alan said, set the ko-count and on-disconnect values in the net{}
> section of drbd.conf, and include your conf in the next email so we know
> what you are set to.

Here's our drbd.conf & ha.cf attached (some things have been tweaked since the 
test I described, but nothing major).

> BTW which protocol are you using?

Protocol c.

Thank you,
Nate

> _______________________________________________
> Linux-HA mailing list
> [EMAIL PROTECTED]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

# $Id: ha.cf.in 7546 2006-01-30 22:23:50Z nreed $
#
#       There are lots of options in this file.  All you have to have is a set
#       of nodes listed {"node ...} one of {serial, bcast, mcast, or ucast},
#       and a value for "auto_failback".
#
#       ATTENTION: As the configuration file is read line by line,
#                  THE ORDER OF DIRECTIVE MATTERS!
#
#       In particular, make sure that the udpport, serial baud rate
#       etc. are set before the heartbeat media are defined!
#       debug and log file directives go into effect when they
#       are encountered.
#
#       All will be fine if you keep them ordered as in this example.
#
#
#       Note on logging:
#       If any of debugfile, logfile and logfacility are defined then they
#       will be used. If debugfile and/or logfile are not defined and
#       logfacility is defined then the respective logging and debug
#       messages will be loged to syslog. If logfacility is not defined
#       then debugfile and logfile will be used to log messges. If
#       logfacility is not defined and debugfile and/or logfile are not
#       defined then defaults will be used for debugfile and logfile as
#       required and messages will be sent there.
#
#       File to write debug messages to
#debugfile /var/log/ha-debug
#
#
#       File to write other messages to
#
#logfile        /var/log/ha-log
#
#
#       Facility to use for syslog()/logger 
#
#logfacility local7    
#
#
#       A note on specifying "how long" times below...
#
#       The default time unit is seconds
#               10 means ten seconds
#
#       You can also specify them in milliseconds
#               1500ms means 1.5 seconds
#
#
#       keepalive: how long between heartbeats?
#
keepalive 2
#
#       deadtime: how long-to-declare-host-dead?
#
#               If you set this too low you will get the problematic
#               split-brain (or cluster partition) problem.
#               See the FAQ for how to use warntime to tune deadtime.
#
deadtime 30
#
#       warntime: how long before issuing "late heartbeat" warning?
#       See the FAQ for how to use warntime to tune deadtime.
#
warntime 5 
#
#
#       Very first dead time (initdead)
#
#       On some machines/OSes, etc. the network takes a while to come up
#       and start working right after you've been rebooted.  As a result
#       we have a separate dead time for when things first come up.
#       It should be at least twice the normal dead time.
#
initdead 120
#
#
#       What UDP port to use for bcast/ucast communication?
#
udpport 694
#
#       Baud rate for serial ports...
#
#baud   19200
#       
#       serial  serialportname ...
#serial /dev/ttyS0      # Linux
#serial /dev/cuaa0      # FreeBSD
#serial /dev/cua/a      # Solaris
#
#
#       What interfaces to broadcast heartbeats over?
#
#bcast  eth0            # Linux
bcast   eth2    # Linux
#bcast  le0             # Solaris
#bcast  le1 le2         # Solaris
#
#       Set up a multicast heartbeat medium
#       mcast [dev] [mcast group] [port] [ttl] [loop]
#
#       [dev]           device to send/rcv heartbeats on
#       [mcast group]   multicast group to join (class D multicast address
#                       224.0.0.0 - 239.255.255.255)
#       [port]          udp port to sendto/rcvfrom (set this value to the
#                       same value as "udpport" above)
#       [ttl]           the ttl value for outbound heartbeats.  this effects
#                       how far the multicast packet will propagate.  (0-255)
#                       Must be greater than zero.
#       [loop]          toggles loopback for outbound multicast heartbeats.
#                       if enabled, an outbound packet will be looped back and
#                       received by the interface it was sent on. (0 or 1)
#                       Set this value to zero.
#               
#
#mcast eth0 225.0.0.1 694 1 0
#
#       Set up a unicast / udp heartbeat medium
#       ucast [dev] [peer-ip-addr]
#
#       [dev]           device to send/rcv heartbeats on
#       [peer-ip-addr]  IP address of peer to send packets to
#
#ucast eth0 192.168.1.2
#
#
#       About boolean values...
#
#       Any of the following case-insensitive values will work for true:
#               true, on, yes, y, 1
#       Any of the following case-insensitive values will work for false:
#               false, off, no, n, 0
#
#
#
#       auto_failback:  determines whether a resource will
#       automatically fail back to its "primary" node, or remain
#       on whatever node is serving it until that node fails, or
#       an administrator intervenes.
#
#       The possible values for auto_failback are:
#               on      - enable automatic failbacks
#               off     - disable automatic failbacks
#               legacy  - enable automatic failbacks in systems
#                       where all nodes do not yet support
#                       the auto_failback option.
#
#       auto_failback "on" and "off" are backwards compatible with the old
#               "nice_failback on" setting.
#
#       See the FAQ for information on how to convert
#               from "legacy" to "on" without a flash cut.
#               (i.e., using a "rolling upgrade" process)
#
#       The default value for auto_failback is "legacy", which
#       will issue a warning at startup.  So, make sure you put
#       an auto_failback directive in your ha.cf file.
#       (note: auto_failback can be any boolean or "legacy")
#
auto_failback off
#
#
#       Basic STONITH support
#       Using this directive assumes that there is one stonith 
#       device in the cluster.  Parameters to this device are 
#       read from a configuration file. The format of this line is:
#
#         stonith <stonith_type> <configfile>
#
#       NOTE: it is up to you to maintain this file on each node in the
#       cluster!
#
#stonith baytech /etc/ha.d/conf/stonith.baytech
#
#       STONITH support
#       You can configure multiple stonith devices using this directive.
#       The format of the line is:
#         stonith_host <hostfrom> <stonith_type> <params...>
#         <hostfrom> is the machine the stonith device is attached
#              to or * to mean it is accessible from any host. 
#         <stonith_type> is the type of stonith device (a list of
#              supported drives is in /usr/lib/stonith.)
#         <params...> are driver specific parameters.  To see the
#              format for a particular device, run:
#           stonith -l -t <stonith_type> 
#
#
#       Note that if you put your stonith device access information in
#       here, and you make this file publically readable, you're asking
#       for a denial of service attack ;-)
#
#       To get a list of supported stonith devices, run
#               stonith -L
#       For detailed information on which stonith devices are supported
#       and their detailed configuration options, run this command:
#               stonith -h
#
#stonith_host *     baytech 10.0.0.3 mylogin mysecretpassword
#stonith_host ken3  rps10 /dev/ttyS1 kathy 0 
#stonith_host kathy rps10 /dev/ttyS1 ken3 0 
stonith_host * wti_nps @POWER_SWITCH_IPADDR@ @POWER_SWITCH_PASSWORD@
#
#       Watchdog is the watchdog timer.  If our own heart doesn't beat for
#       a minute, then our machine will reboot.
#       NOTE: If you are using the software watchdog, you very likely
#       wish to load the module with the parameter "nowayout=0" or
#       compile it without CONFIG_WATCHDOG_NOWAYOUT set. Otherwise even
#       an orderly shutdown of heartbeat will trigger a reboot, which is
#       very likely NOT what you want.
#
#watchdog /dev/watchdog
#       
#       Tell what machines are in the cluster
#       node    nodename ...    -- must match uname -n
node    node1.cluster1.test.awarix.com
node    node2.cluster1.test.awarix.com
#
#       Less common options...
#
#       Treats 10.10.10.254 as a psuedo-cluster-member
#       Used together with ipfail below...
#
ping @PING_ROUTER@
#
#       Treats 10.10.10.254 and 10.10.10.253 as a psuedo-cluster-member
#       called group1. If either 10.10.10.254 or 10.10.10.253 are up
#       then group1 is up
#       Used together with ipfail below...
#
#ping_group group1 10.10.10.254 10.10.10.253
#
#       HBA ping derective for Fiber Channel
#       Treats fc-card-name as psudo-cluster-member
#       used with ipfail below ...
#
#       You can obtain HBAAPI from http://hbaapi.sourceforge.net.  You need 
#       to get the library specific to your HBA directly from the vender
#       To install HBAAPI stuff, all You need to do is to compile the common
#       part you obtained from the sourceforge. This will produce libHBAAPI.so 
#       which you need to copy to /usr/lib. You need also copy hbaapi.h to 
#       /usr/include.
#       
#       The fc-card-name is the name obtained from the hbaapitest program 
#       that is part of the hbaapi package. Running hbaapitest will produce
#       a verbose output. One of the first line is similar to:
#               Apapter number 0 is named: qlogic-qla2200-0
#       Here fc-card-name is qlogic-qla2200-0.  
#
#hbaping fc-card-name
#
#
#       Processes started and stopped with heartbeat.  Restarted unless
#               they exit with rc=100
#
#respawn userid /path/name/to/run
# Only use this with R1.x style clusters:
respawn hacluster /usr/lib64/heartbeat/ipfail
#
#       Access control for client api
#               default is no access
#
#apiauth client-name gid=gidlist uid=uidlist
#apiauth ipfail gid=haclient uid=hacluster

###########################
#
#       Unusual options.
#
###########################
#
#       hopfudge maximum hop count minus number of nodes in config
#hopfudge 1
#
#       deadping - dead time for ping nodes
deadping 30
#
#       hbgenmethod - Heartbeat generation number creation method
#               Normally these are stored on disk and incremented as needed.
#hbgenmethod time
#
#       realtime - enable/disable realtime execution (high priority, etc.)
#               defaults to on
#realtime off
#
#       debug - set debug level
#               defaults to zero
#debug 1
#
#       API Authentication - replaces the fifo-permissions-based system of the 
past
#
#
#       You can put a uid list and/or a gid list.
#       If you put both, then a process is authorized if it qualifies under 
either
#       the uid list, or under the gid list.
#
#       The groupname "default" has special meaning.  If it is specified, then
#       this will be used for authorizing groupless clients, and any client 
groups
#       not otherwise specified.
#       
#       There is a subtle exception to this.  "default" will never be used in 
the 
#       following cases (actual default auth directives noted in brackets)
#                 ipfail        (uid=HA_CCMUSER)
#                 ccm           (uid=HA_CCMUSER)
#                 ping          (gid=HA_APIGROUP)
#                 cl_status     (gid=HA_APIGROUP)
#
#       This is done to avoid creating a gaping security hole and matches the 
most
#       likely desired configuration.
#
#apiauth ipfail uid=hacluster
#apiauth ccm uid=hacluster
#apiauth cms uid=hacluster
#apiauth ping gid=haclient uid=alanr,root
#apiauth default gid=haclient

#       message format in the wire, it can be classic or netstring, 
#       default: classic
#msgfmt  classic/netstring

#       do we use logging daemon?
#       detail policy:
#       1. if there is any entry for debugfile/logfile/logfacility in ha.cf
#        a) if use_logd is not set, logging daemon will not be used
#        b) if use_logd is set to on, logging daemon will be used
#        c) if use_logd is set to off, logging daemon will not be used
#
#
#       2. if there is no entry for debugfile/logfile/logfacility in ha.cf
#        a) if use_logd is not set, logging daemon will be used 
#        b) if use_logd is set to on, logging daemon will be used
#        c) if use_logd is set to off, config error, i.e. you can not turn
#           off all logging options
#       
#       If logging daemon is used, logfile/debugfile/logfacility in this file
#       are not meaningful any longer. You should check the config file for 
logging
#       daemon (the default is /etc/logd.cf)
#
#       If you are not sure about this option, don't configure it
#
# use_logd yes/no
use_logd yes

#
#       the interval we  reconnect to logging daemon if the previous connection 
failed
#       default: 60 seconds
#conn_logd_time 60
#
#
#       Configure compression module
#       It could be zlib or bz2, depending on whether u have the corresponding 
#       library in the system.
compression     bz2
#
#       Confiugre compression threshold
#       This value determines the threshold to compress a message,
#       e.g. if the threshold is 1, then any message with size greater than 1 KB
#       will be compressed, the default is 2 (KB)
compression_threshold 2

# Enable Heartbeat to do core dumps
coredumps true

# Enable Cluster Resource Manager
crm no 

# Increase the real-time priority of Heartbeat 
# This should eliminate late heartbeat messages in the logs
rtprio 99

# $Id: drbd.conf.in 7138 2005-12-30 17:55:40Z nreed $
#
# drbd.conf example
#
# parameters you _need_ to change are the hostname, device, disk,
# meta-disk, address and port in the "on <hostname> {}" sections.
#
# you ought to know about the protocol, and the various timeouts.
#
# you probably want to set the rate in the syncer sections
#
# increase timeout and maybe ping-int in net{}, if you see
# problems with "connection lost/connection established"
# (or change your setup to reduce network latency; make sure full
#  duplex behaves as such; check average roundtrip times while
#  network is saturated; and so on ...)
#

#
# Upgrading from DRBD-0.6.x
#
# Using the size parameter in the disk section (was disk-size) is
# no longer valid. The agreed disk size is now stored
# in DRBD's non volatile meta data files.
#
# NOTE that if you do not have some dedicated partition to use for
# the meta-data, you may use 'internal' meta-data.
#
#       THIS HOWEVER WILL DESTROY THE LAST 128M
#       OF THE LOWER LEVEL DEVICE.
#
# So you better make sure you shrink the filesystem by 128M FIRST!
# or by 132M just to be sure... :)
#

skip {
  As you can see, you can also comment chunks of text
  with a 'skip[optional nonsense]{ skipped text }' section.
  This comes in handy, if you just want to comment out
  some 'resource <some name> {...}' section:
  just precede it with 'skip'.

  The basic format of option assignment is
  <option name><linear whitespace><value>;
  
  It should be obvious from the examples below,
  but if you really care to know the details:
  
  <option name> :=
        valid options in the respective scope
  <value>  := <num>|<string>|<choice>|...
              depending on the set of allowed values
              for the respective option.
  <num>    := [0-9]+, sometimes with an optional suffix of K,M,G
  <string> := (<name>|\"([^\"\\\n]*|\\.)*\")+
  <name>   := [/_.A-Za-z0-9-]+
}

#
# At most ONE global section is allowed.
# It must precede any resource section.
#
# global {
    # use this if you want to define more resources later
    # without reloading the module.
    # by default we load the module with exactly as many devices
    # as configured mentioned in this file.
    #
    # minor_count 5;

    # this is for people who set up a drbd device via the
    # loopback network interface or between two VMs on the same
    # box, for testing/simulating/presentation
    # otherwise it could trigger a run_tasq_queue deadlock.
    # I'm not sure whether this deadlock can happen with two
    # nodes, but it seems at least extremely unlikely; and since
    # the io_hints boost performance, keep them enabled.
    #
    # With linux 2.6 it no longer makes sense.
    # So this option should vanish.     --lge
    #
    # disable_io_hints
# }

#
# this need not be r#, you may use phony resource names,
# like "resource web" or "resource mail", too
#

resource r0 {

  # transfer protocol to use.
  # C: write IO is reported as completed, if we know it has
  #    reached _both_ local and remote DISK.
  #    * for critical transactional data.
  # B: write IO is reported as completed, if it has reached
  #    local DISK and remote buffer cache.
  #    * for most cases.
  # A: write IO is reported as completed, if it has reached
  #    local DISK and local tcp send buffer. (see also sndbuf-size)
  #    * for high latency networks
  #
  #**********
  # uhm, benchmarks have shown that C is actually better than B.
  # this note shall disappear, when we are convinced that B is
  # the right choice "for most cases".
  # Until then, always use C unless you have a reason not to.
  #     --lge
  #**********
  #
  protocol C;

  # what should be done in case the cluster starts up in
  # degraded mode, but knows it has inconsistent data.
  incon-degr-cmd "halt -f";

  startup {
    # Wait for connection timeout. 
    # The init script blocks the boot process until the resources
    # are connected. 
    # In case you want to limit the wait time, do it here.
    #
    # wfc-timeout  0;

    # Wait for connection timeout if this node was a degraded cluster.
    # In case a degraded cluster (= cluster with only one node left)
    # is rebooted, this timeout value is used. 
    #
    degr-wfc-timeout 120;    # 2 minutes.
  }

  disk {
    # if the lower level device reports io-error you have the choice of
    #  "pass_on"  ->  Report the io-error to the upper layers.
    #                 Primary   -> report it to the mounted file system.
    #                 Secondary -> ignore it.
    #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
    #  "detach"   ->  The node drops its backing storage device, and
    #                 continues in disk less mode.
    #
    on-io-error   detach;
  }

  net {
    # this is the size of the tcp socket send buffer
    # increase it _carefully_ if you want to use protocol A over a
    # high latency network with reasonable write throughput.
    # defaults to 2*65535; you might try even 1M, but if your kernel or
    # network driver chokes on that, you have been warned.
    sndbuf-size 512k;

    timeout       60;    #  6 seconds  (unit = 0.1 seconds)
    connect-int   10;    # 10 seconds  (unit = 1 second)
    ping-int      10;    # 10 seconds  (unit = 1 second)

    # Maximal number of requests (4K) to be allocated by DRBD.
    # The minimum is hardcoded to 32 (=128 kb).
    # For hight performance installations it might help if you
    # increase that number. These buffers are used to hold
    # datablocks while they are written to disk.
    #
    # max-buffers     2048;

    # The highest number of data blocks between two write barriers. 
    # If you set this < 10 you might decrease your performance.
    # max-epoch-size  2048;

    # if some block send times out this many times, the peer is 
    # considered dead, even if it still answers ping requests.
    # ko-count 4;

    # if the connection to the peer is lost you have the choice of
    #  "reconnect"   -> Try to reconnect (AKA WFConnection state)
    #  "stand_alone" -> Do not reconnect (AKA StandAlone state)
    #  "freeze_io"   -> Try to reconnect but freeze all IO until
    #                   the connection is established again.
    on-disconnect reconnect;

  }

  syncer {
    # Limit the bandwith used by the resynchronisation process.
    # default unit is KB/sec; optional suffixes K,M,G are allowed
    #
    rate 10M;

    # All devices in one group are resynchronized parallel. 
    # Resychronisation of groups is serialized in ascending order. 
    # Put DRBD resources which are on different physical disks in one group.
    # Put DRBD resources on one physical disk in different groups.
    #
    group 1;

    # Configures the size of the active set. Each extent is 4M, 
    # 257 Extents ~> 1GB active set size. In case your syncer
    # runs @ 10MB/sec, all resync after a primary's crash will last
    # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds.
    # BTW, the hash algorithm works best if the number of al-extents
    # is prime. (To test the worst case performace use a power of 2)
    al-extents 257;
  }

  on node1.cluster1.test.awarix.com {
    device     /dev/drbd0;
    disk       @SHARED_DEVICE@;
    address    169.254.0.1:7788;
    meta-disk  @[EMAIL PROTECTED];

    # meta-disk is either 'internal' or '/dev/ice/name [idx]'
    #
    # You can use a single block device to store meta-data
    # of multiple DRBD's.
    # E.g. use meta-disk /dev/hde6[0]; and meta-disk /dev/hde6[1];
    # for two different resources. In this case the meta-disk
    # would need to be at least 256 MB in size.
    #
    # 'internal' means, that the last 128 MB of the lower device
    # are used to store the meta-data.
    # You must not give an index with 'internal'.
  }

  on node2.cluster1.test.awarix.com {
    device    /dev/drbd0;
    disk      @SHARED_DEVICE@;
    address   169.254.0.2:7788;
    meta-disk @[EMAIL PROTECTED];
  }
}

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] Re: [Linux-HA] kernel panic on disk failure

Reply via email to