Hi all,

I'm looking for some inputs from the experts !

Short story
-----------
zeroing my DRBD device before using it turns a non-working system into a
working one, and I'm trying to figure out why. I'm also trying to
understand if I will have other problems down the road.

Long story
----------
I am building a pair of redundant iSCSI targets for VMware ESX4.1, using
the following software components:
- Fedora 12 x86_64
- DRBD 8.3.8.1
- pacemaker 1.0.9
- corosync 1.2.8
- SCST iSCSI Target (using SVN trunk, almost 2.0)

SCST isn't cluster aware, so I'm using DRBD in primary/secondary mode.
I'm creating two iSCSI targets, one on each node, with mutual failover
and no multipath. As a reference for the discussion, I'm attaching my
resource agent, my CIB and my DRBD config files. The resource agent is a
modification of iSCSITarget/iSCSILun with some SCST specifics.

When running this setup on a pair of physical hosts, everything works
fine. However, my interest is in small setups and I want to run the two
targets in VMs, hosted on the ESX hosts that will be the iSCSI
initiators. The market calls this a virtual SAN... I know, I know, this
is not recommended, but it definitely exists as commercial solutions,
and makes a lot of sense for small setups. I'm not looking for perf, but
for high-availability.

This being said, I have two ways to present disk space (physical) to
DRBD (they are /dev/sdb and /dev/sdc in the VMs):

1) Map raid volumes to the Fedora VMs using RDM (Raw Device Mapping)
2) Format the raid volumes with VMFS, and create virtual disks (VMDKs)
in that datastore for the Fedora VMs.

Option 1) obviously works better, but is not always possible (many
restrictions on RAID controllers, for instance).

Option 2) works fine until I put iSCSI WRITE load on my Fedora VM. When
using large blocks, I quickly end up with stale VMs. The iSCSI target
complains that the backend device doesn't respond, and the kernel gives
me 120 seconds timeouts for the DRBD threads. The DRBD backend devices
appear dead. At this stage, there is no iSCSI traffic anymore, CPU usage
in null, memory is fine, starvation.... Rebooting the Fedora VM solves
the problem. Seen from a DRBD/SCST point of view, it's as if backend
hardware was failing. However, physical disks/arrays are fine. The
problem is clearly within VMware.

One of the VMware recommendation is to create the large VMDKs in
'eagerZeroedThick', which basically zeroes everything before use. This
helps, but doesn't solve the problem completely.

I then tried a third option: format /dev/drbd0 with XFS, create one BIG
file (using dd) on that filesystem, and export this file via iSCSI/SCST
(instead of exporting the /dev/drbd0 block device directly). I couldn't
crash this setup, but I don't like the idea of having a single 200G file
on a 99% full filesystem.

This brought me to option 4: I directly export /dev/drbd0 via SCST (same
as option 1 and 2), but before using it, I issue a:

        dd if=/dev/zero of=/dev/drbd0 bs=4096

I'm now running this setup since 2 weeks, trying to put as much load as
I can on it (mainly using dd, bonnie++, DiskTT and running VMware
Storage vMotion). The only issue I have faced is that sometimes the
pacemaker 'monitor' action takes more than 20 seconds to run on DRBD, so
I have increased this timeout to 60s. Since then, no problem at all!

As you can imagine, I'm pretty happy with the setup, but I still don't
fully understand why it now works. I hate these situations...

Can zeroing make such a big difference ? Does it just make a difference
at the RAID/disk level, or does it also make a difference at the DRBD
level ?

Sorry for the long e-mail, and thanks a ton for any input. - Patrick -

PS: Based on my reading, many people are trying to implement such
solutions. XtraVirt had a VM at some point, but not anymore. People are
trying to do it with OpenFiler, but IET and VMware don't like each
other. My setup is not documented the way it should, but I'm ready to
share if anyone wants to play with it.


**************************************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager. [email protected]
**************************************************************************************
node tstore-a.labo.local \
        attributes standby="off"
node tstore-b.labo.local \
        attributes standby="off"
primitive res_drbd_a ocf:linbit:drbd \
        params drbd_resource="drbd_a" \
        op monitor interval="29s" role="Master" timeout="60s" \
        op monitor interval="30s" role="Slave" timeout="60s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive res_drbd_b ocf:linbit:drbd \
        params drbd_resource="drbd_b" \
        op monitor interval="29s" role="Master" timeout="60s" \
        op monitor interval="30s" role="Slave" timeout="60s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive res_ip_a ocf:heartbeat:IPaddr2 \
        params ip="10.0.0.241" \
        op monitor interval="10s"
primitive res_ip_b ocf:heartbeat:IPaddr2 \
        params ip="10.0.0.242" \
        op monitor interval="10s"
primitive res_scst_a ocf:heartbeat:scstlun \
        params iqn="iqn.2004-04.local.scst:ta" path="/dev/drbd0" \
        op monitor interval="10s" \
        op stop interval="0" timeout="60s"
primitive res_scst_b ocf:heartbeat:scstlun \
        params iqn="iqn.2004-04.local.scst:tb" path="/dev/drbd1" \
        op monitor interval="10s" \
        op stop interval="0" timeout="60s"
group grp_iscsi_a res_scst_a res_ip_a \
        meta target-role="Started"
group grp_iscsi_b res_scst_b res_ip_b \
        meta target-role="Started"
ms ms_drbd_a res_drbd_a \
        meta clone-max="2" notify="true"
ms ms_drbd_b res_drbd_b \
        meta clone-max="2" notify="true"
location loc_drbd_a-master ms_drbd_a \
        rule $id="loc_drbd_a-master-rule" $role="master" 1000: #uname eq 
tstore-a.labo.local
location loc_drbd_b-master ms_drbd_b \
        rule $id="loc_drbd_b-master-rule" $role="master" 1000: #uname eq 
tstore-b.labo.local
colocation col_grp_iscsi_a_on_drbd_a_master inf: grp_iscsi_a ms_drbd_a:Master
colocation col_grp_iscsi_b_on_drbd_b_master inf: grp_iscsi_b ms_drbd_b:Master
order ord_drbd_a_before_grp_iscsi_a inf: ms_drbd_a:promote grp_iscsi_a:start
order ord_drbd_b_before_grp_iscsi_b inf: ms_drbd_b:promote grp_iscsi_b:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-89bd754939df5150de7cd76835f98fe90851b677" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1284912732"

Attachment: scstlun
Description: scstlun

Attachment: drbd_a.res
Description: drbd_a.res

Attachment: drbd_b.res
Description: drbd_b.res

Attachment: global_common.conf
Description: global_common.conf

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to