On 19/09/2016 19:06, Marco Marino wrote:
2016-09-19 10:50 GMT+02:00 Igor Cicimov
On 19 Sep 2016 5:45 pm, "Marco Marino" <marino....@gmail.com
> Hi, I'm trying to build an active/passive cluster with drbd and
pacemaker for a san. I'm using 2 nodes with one raid controller
(megaraid) on each one. Each node has an ssd disk that works as
cache for read (and write?) realizing the CacheCade proprietary
Did you configure the CacheCade? If the write cache was enabled in
write-back mode then suddenly removing the device from under the
controller would have caused serious problems I guess since the
controller expects to write to the ssd cache firts and then flush
to the hdd's. Maybe this explains the read only mode?
Good point. It is exactly as you wrote. How can I mitigate this
behavior in a clustered (active/passive) enviroment??? As I told in
the other post, I think the best solution is to poweroff the node
using local-io-error and switch all resources on the other node....
But please give me some suggestions....
I think you will need to examine the logs to find out what happened. It
would appear (just making a wild guess) that either the cache is
happening between DRBD and iSCSI instead of between DRBD and RAID. If it
happened under DRBD then DRBD should see the read/write error, and
should automatically fail the local storage. It wouldn't necessarily
failover to the secondary, but it would do all read/write from the
secondary node. The fact this didn't happen makes it look like the
failure happened above DRBD.
> Basically, the structure of the san is:
> Physycal disks -> RAID -> Device /dev/sdb in the OS -> Drbd
resource (that use /dev/sdb as backend) (using pacemaker with a
master/slave resource) -> VG (managed with pacemaker) -> Iscsi
target (with pacemaker) -> Iscsi LUNS (one for each logical volume
in the VG, managed with pacemaker)
> Few days ago, the ssd disk was wrongly removed from the primary
node of the cluster and this caused a lot of problems: drbd
resource and all logical volumes went in readonly mode with a lot
of I/O errors but the cluster did not switched to the other node.
All filesystem on initiators went to readonly mode. There are 2
problems involved here (I think): 1) Why removing the ssd disk
cause a readonly mode with I/O errors? This means that the ssd is
a single point of failure for a single node san with megaraid
controllers and CacheCade tecnology..... and 2) Why drbd not
worked as espected?
What was the state in /proc/drbd ?
At least that is my understanding of how it will work in that scenario.
drbd-user mailing list