    > Hi, I'm trying to build an active/passive cluster with drbd and
    pacemaker for a san. I'm using 2 nodes with one raid controller
    (megaraid) on each one. Each node has an ssd disk that works as
    cache for read (and write?) realizing the CacheCade proprietary
    Did you configure the CacheCade? If the write cache was enabled in
    write-back mode then suddenly removing the device from under the
    controller would have caused serious problems I guess since the
    controller expects to write to the ssd cache firts and then flush
    to the hdd's. Maybe this explains the read only mode?

Good point. It is exactly as you wrote. How can I mitigate this behavior in a clustered (active/passive) enviroment??? As I told in the other post, I think the best solution is to poweroff the node using local-io-error and switch all resources on the other node.... But please give me some suggestions....

    > Basically, the structure of the san is:
    > Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd
    resource (that use /dev/sdb as backend) (using pacemaker with a
    master/slave resource) -> VG (managed with pacemaker) -> Iscsi
    target (with pacemaker) -> Iscsi LUNS (one for each logical volume
    in the VG, managed with pacemaker)
    > Few days ago, the ssd disk was wrongly removed from the primary
    node of the cluster and this caused a lot of problems: drbd
    resource and all logical volumes went in readonly mode with a lot
    of I/O errors but the cluster did not switched to the other node.
    All filesystem on initiators went to readonly mode. There are 2
    problems involved here (I think): 1) Why removing the ssd disk
    cause a readonly mode with I/O errors? This means that the ssd is
    a single point of failure for a single node san with megaraid
    controllers and CacheCade tecnology..... and 2) Why drbd not
    worked as espected?
    What was the state in /proc/drbd ?

I think you will need to examine the logs to find out what happened. It would appear (just making a wild guess) that either the cache is happening between DRBD and iSCSI instead of between DRBD and RAID. If it happened under DRBD then DRBD should see the read/write error, and should automatically fail the local storage. It wouldn't necessarily failover to the secondary, but it would do all read/write from the secondary node. The fact this didn't happen makes it look like the failure happened above DRBD.

At least that is my understanding of how it will work in that scenario.

