[ha-clusters-discuss] How to setup up highly available zfs file system replicated synchronously across two sites?

Fredrich Maney Thu, 11 Mar 2010 12:01:56 -0500

So, if I am understanding you right, what you really have is 2 copies
of the filesystem - one on each node. The copy on each node is
actually a mirror comprised of one local LUN and one remote LUN.


Something like this ASCII drawing:

node0     node1
|         \     /          |
|          \   /           |
|           X             |
|          /   \           |
|         /     \          |
mirror0   mirror1
|         \     /          |
|          \   /           |
|           X             |
|          /   \           |
|         /     \          |
lun0         lun1

You are effectively doing the same thing, albeit in a more fault prone
manner, as what I suggested - just at the LVM level instead of at the
hardware level (dual, direct attached disk or SAN attached disk).

I think you are going to run into problems here, particularly with ZFS
since it is not multi-initiator aware and can not be present on more
than 1 node at a time.

I suspect that to do this, you are going to need to follow Steve
Mckinty's advice and look at Sun Cluster Geographic Edition. I believe
Tim Read wrote a Blueprint up about just this sort of scenario.

fpsm

On Thu, Mar 11, 2010 at 11:03 AM, Anton Altaparmakov <aia21 at cam.ac.uk> wrote:
> Hi,
>
> Thank you for the quick reply but I am afraid I didn't express myself well 
> enough...
>
> On 11 Mar 2010, at 14:43, Fredrich Maney wrote:
>> In order for a filesystem (any filesystem on any OS) to failover
>> between nodes, that filesystem needs to be on shared storage that is
>> external to all nodes. This is because if the node that hosts the
>> storage fails, i.e. has a system board failure, there is no way for
>> the other node to see it.
>
> No it doesn't... ?That is not what we do on Linux. ?The storage is replicated 
> on each node.
>
>> You are already doing this in your working example on Linux - the
>> iSCSI LUNs are presented to both nodes in the cluster from whatever
>> device is hosting the iSCSI LUNs.
>
> No. ?Each node IS the storage in our setup. ?Here is what we exactly have 
> with Linux:
>
> LVM provided blockdevice on node1 and LVM provided blockdevice on node2.
>
> When node1 is master we have:
>
> - node2 exports the blockdevice via iscsi_target
> - node1 imports the blockdevice from node1 via open-iscsi
> - node1 runs Linux software RAID (MD) in synchronous mirror mode between the 
> local blockdevice and the from node2 iscsi imported blockdevice
> - node1 mounts the software RAID MD device using XFS
> - node1 runs NFS server exporting XFS file system
> - node1 has service IP address
>
> When node1 fails (or we ask heartbeat to move the service to the other node), 
> we:
>
> - stop using the IP address on node1
> - shut down NFS server on node1
> - unmount XFS file system on node1
> - stop the RAID device on node1
> - stop importing the iscsi device on node1
> - node2 stops exporting the blockdevice using issi_target
>
> And then we do as above in reverse, i.e.
>
> - node1 exports the blockdevice via iscsi_target
> - node2 imports the blockdevice from node2 via open-iscsi
> - node2 runs Linux software RAID (MD) in synchronous mirror mode between the 
> local blockdevice and the from node1 iscsi imported blockdevice
> - node2 mounts the software RAID MD device using XFS
> - node2 runs NFS server exporting XFS file system
> - node2 has service IP address
>
> And all this happens within a matter of seconds so that the NFS connections 
> do not even notice the interruption at all. ?You just get a brief pause on 
> the NFS clients and then they carry on as before without even knowing that 
> they are now talking to a completely different server.
>
>> You just need to do the same thing thing on the Solaris side. However,
>> remember that ZFS is not multi-initiator aware, so you can not mount
>> the zpools on both nodes at once without disk corruption. You will
>> probably want to wrap the service, ip and storage in a zone and fail
>> that over all together instead of separately at the global zone level.
>>
>> Google is your friend. I'd suggest searching for "Solaris Cluster iSCSI 
>> zone".
>
> I would but that is not what we want to do at all...
>
> Trust me I just spent close to two weeks trying to get this to work and I 
> have read all Sun documentation that seemed relevant and all that google 
> found that seemed relevant but I am hoping I have missed something obvious 
> because I can't see how to do it...
>
> Best regards,
>
> ? ? ? ?Anton
>
>> fpsm
>>
>> On Thu, Mar 11, 2010 at 9:18 AM, Anton Altaparmakov <aia21 at cam.ac.uk> 
>> wrote:
>>> Hi,
>>>
>>> I have been trying to setup Solaris Storage AVS with Sun Cluster in the 
>>> hope of having a ZFS file system replicated synchronously (via TCP/IP only) 
>>> between two machines so that it is mounted on one machine read-write and if 
>>> that machine fails it is mounted read-write on the other machine.
>>>
>>> I have been reading all sorts of documentation and man pages and 
>>> experimenting but everything I have tried immediately asks for 
>>> configuration of shared storage which we don't have as the two machines are 
>>> only connected by TCP/IP.
>>>
>>> We have such a system running at the moment using Linux, iSCSI plus 
>>> software raid for the replication and XFS as the file system and heartbeat 
>>> v2 for the failover and that works well. ?We then have an NFS server which 
>>> exports the XFS file system and the NFS server is migrated together with 
>>> the service ip address and the XFS file system between the two nodes in the 
>>> heartbeat cluster but I have now spent ages trying to figure out what to do 
>>> with Sun Cluster and AVS to achieve the same and I am completely failing to 
>>> do it. ?)-:
>>>
>>> Would someone, pretty please with sugar on top, point me at the 
>>> documentation I am failing to find or alternatively giving me some pointers 
>>> as to which commands it is I should be using?
>>>
>>> Thank you very much in advance!
>>>
>>> Best regards,
>>>
>>> ? ? ? ?Anton
> --
> Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
> Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
> Linux NTFS maintainer, http://www.linux-ntfs.org/
>
>

[ha-clusters-discuss] How to setup up highly available zfs file system replicated synchronously across two sites?

Reply via email to