Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Hi Mike, I've been working on some resource agents to configure LIO to use implicit ALUA in an Active/Standby config across 2 hosts. After a week long crash course in pacemaker and LIO, I now have a very sore head but it looks like it's working fairly well. I hope to be in a position in the next few days where I can share these scripts if there is interest. It's based loosely on the thread that you linked below, where the TPG's are offset on each host so that the same ID is active on both nodes, but the ones that are actually bound to the IQN are different ids on each node. This is then presented to ESX hosts via 4 iSCSI network portals (2 per host, to achieve redundant fabric over 2 switches) According to ESX the VAAI extensions are in use. From your first email to seem to say that using ATS locking is ok to use in an active/standby config, can you just confirm this? Hope that helps Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mike Christie Sent: 28 January 2015 03:06 To: Zoltan Arnold Nagy; Jake Young Cc: Nick Fisk; ceph-users Subject: Re: [ceph-users] Ceph, LIO, VMWARE anyone? Oh yeah, I am not completely sure (have not tested myself), but if you were doing a setup where you were not using a clustering app like windows/redhat clustering that uses PRs, did not use vmfs and were instead accessing the disks exported by LIO/TGT directly in the vm (either using the guest's iscsi client or as a raw esx device), and were not using ESX clustering, then you might be safe doing active/passive or active/active with no modifications needed other than some scripts to distribute the setup info across LIO/TGT nodes. Were any of you trying this type of setup when you were describing your results? If so, were you running oracle or something like that? Just wondering. On 01/27/2015 08:58 PM, Mike Christie wrote: I do not know about perf, but here is some info on what is safe and general info. - If you are not using VAAI then it will use older style RESERVE/RELEASE commands only. If you are using VAAI ATS, and doing active/active then you need something, like the lock/sync talked about in the slides/hammer doc, that would coordinate multiple ATS/COMPARE_AND_WRITEs from executing at the same time on the same sectors. You probably do not ever see problems today, because it seems ESX normally does this command for only one sector and I do not think there are multiple commands for the same sectors in flight normally. For active/passive, ATS is simple since you only have the one LIO/TGT node executing commands at a time, so the locking is done locally using a normal old mutex. - tgt and LIO both support SCSI-3 persistent reservations. This is not really needed for ESX vmfs though since it uses ATS or older RESERVE/RELEASE. If you were using a cluster app like windows clustering, red hat cluster, etc in ESX or in normal non vm use, then you need something extra to support SCSI-3 PRs in both active/active or active/passive. For AA, you need something like described in that doc/video. For AP, you would need to copy over the PR state from one node to the other when failing over/back across nodes. For LIO this is in /var/target. Depending on how you do AP (what ALUA states you use if you do ALUA), you might also need to always distribute the PR info if you are doing windows clustering. Windows wants to see a consistent view of the PR info from all ports if you do something like ALUA active-optimized and standby states for active/passive. - I do not completely understand the comment about using LIO as a backend for tgt. You would either use tgt or LIO to export a rbd device. Not both at the same time like using LIO for some sort of tgt backend. Maybe people meant using the RBD backend instead of LIO backend - There are some other setup complications that you can see here http://comments.gmane.org/gmane.linux.scsi.target.devel/7044 if you are using ALUA. I think tgt does not support ALUA, but LIO does. On 01/23/2015 04:25 PM, Zoltan Arnold Nagy wrote: Correct me if I'm wrong, but tgt doesn't have full SCSI-3 persistence support when _not_ using the LIO backend for it, right? AFAIK you can either run tgt with it's own iSCSI implementation or you can use tgt to manage your LIO targets. I assume when you're running tgt with the rbd backend code you're skipping all the in-kernel LIO parts (in which case the RedHat patches won't help a bit), and you won't have proper active-active support, since the initiators have no way to synchronize state (and more importantly, no way to synchronize write caching! [I can think of some really ugly hacks to get around that, tho...]). On 01/23/2015 05:46 PM, Jake Young wrote: Thanks for the feedback Nick and Zoltan, I have been seeing periodic kernel panics when I used LIO. It was either due to LIO or the kernel rbd mapping. I
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
On 01/28/2015 02:10 AM, Nick Fisk wrote: Hi Mike, I've been working on some resource agents to configure LIO to use implicit ALUA in an Active/Standby config across 2 hosts. After a week long crash course in pacemaker and LIO, I now have a very sore head but it looks like it's working fairly well. I hope to be in a position in the next few days where I can share these scripts if there is interest. Hey, Yes, please share them when they are ready. It's based loosely on the thread that you linked below, where the TPG's are offset on each host so that the same ID is active on both nodes, but the ones that are actually bound to the IQN are different ids on each node. This is then presented to ESX hosts via 4 iSCSI network portals (2 per host, to achieve redundant fabric over 2 switches) According to ESX the VAAI extensions are in use. From your first email to seem to say that using ATS locking is ok to use in an active/standby config, can you just confirm this? Yes, VAAI's ATS based locking is ok in active/standby with the out of box LIO, TGT and SCST targets. In this type of setup, ESX will only send ATS commands to the active node. LIO and SCST will then just use its local device mutex to make sure we are only running one of the commands at a time on that LIO/SCST node. I do not think TGT actually has locking in the normal/single node case, so I am not sure how safe it is. It is at least as safe as it is in single node use I guess we could say. Hope that helps Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mike Christie Sent: 28 January 2015 03:06 To: Zoltan Arnold Nagy; Jake Young Cc: Nick Fisk; ceph-users Subject: Re: [ceph-users] Ceph, LIO, VMWARE anyone? Oh yeah, I am not completely sure (have not tested myself), but if you were doing a setup where you were not using a clustering app like windows/redhat clustering that uses PRs, did not use vmfs and were instead accessing the disks exported by LIO/TGT directly in the vm (either using the guest's iscsi client or as a raw esx device), and were not using ESX clustering, then you might be safe doing active/passive or active/active with no modifications needed other than some scripts to distribute the setup info across LIO/TGT nodes. Were any of you trying this type of setup when you were describing your results? If so, were you running oracle or something like that? Just wondering. On 01/27/2015 08:58 PM, Mike Christie wrote: I do not know about perf, but here is some info on what is safe and general info. - If you are not using VAAI then it will use older style RESERVE/RELEASE commands only. If you are using VAAI ATS, and doing active/active then you need something, like the lock/sync talked about in the slides/hammer doc, that would coordinate multiple ATS/COMPARE_AND_WRITEs from executing at the same time on the same sectors. You probably do not ever see problems today, because it seems ESX normally does this command for only one sector and I do not think there are multiple commands for the same sectors in flight normally. For active/passive, ATS is simple since you only have the one LIO/TGT node executing commands at a time, so the locking is done locally using a normal old mutex. - tgt and LIO both support SCSI-3 persistent reservations. This is not really needed for ESX vmfs though since it uses ATS or older RESERVE/RELEASE. If you were using a cluster app like windows clustering, red hat cluster, etc in ESX or in normal non vm use, then you need something extra to support SCSI-3 PRs in both active/active or active/passive. For AA, you need something like described in that doc/video. For AP, you would need to copy over the PR state from one node to the other when failing over/back across nodes. For LIO this is in /var/target. Depending on how you do AP (what ALUA states you use if you do ALUA), you might also need to always distribute the PR info if you are doing windows clustering. Windows wants to see a consistent view of the PR info from all ports if you do something like ALUA active-optimized and standby states for active/passive. - I do not completely understand the comment about using LIO as a backend for tgt. You would either use tgt or LIO to export a rbd device. Not both at the same time like using LIO for some sort of tgt backend. Maybe people meant using the RBD backend instead of LIO backend - There are some other setup complications that you can see here http://comments.gmane.org/gmane.linux.scsi.target.devel/7044 if you are using ALUA. I think tgt does not support ALUA, but LIO does. On 01/23/2015 04:25 PM, Zoltan Arnold Nagy wrote: Correct me if I'm wrong, but tgt doesn't have full SCSI-3 persistence support when _not_ using the LIO backend for it, right? AFAIK you can either run tgt with it's own iSCSI implementation or you can use tgt to manage
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
tuning from VMWare and the target software; so the obvious choice is to leave VMWare's path selection at the default which is Fixed and picks the first target in ASCII-betical order. That means I am actually functioning in Active/Passive mode. Jake On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy zol...@linux.vnet.ibm.com mailto:zol...@linux.vnet.ibm.com wrote: Just to chime in: it will look fine, feel fine, but underneath it's quite easy to get VMFS corruption. Happened in our tests. Also if you're running LIO, from time to time expect a kernel panic (haven't tried with the latest upstream, as I've been using Ubuntu 14.04 on my export hosts for the test, so might have improved...). As of now I would not recommend this setup without being aware of the risks involved. There have been a few upstream patches getting the LIO code in better cluster-aware shape, but no idea if they have been merged yet. I know RedHat has a guy on this. On 01/21/2015 02:40 PM, Nick Fisk wrote: Hi Jake, Thanks for this, I have been going through this and have a pretty good idea on what you are doing now, however I maybe missing something looking through your scripts, but I’m still not quite understanding how you are managing to make sure locking is happening with the ESXi ATS SCSI command. From this slide http://xo4t.mjt.lu/link/xo4t/gzyhtx3/1/_9gJVMUrSdvzGXYaZfCkVA/aHR0cHM6Ly93aWtpLmNlcGguY29tL0BhcGkvZGVraS9maWxlcy8zOC9oYW1tZXItY2VwaC1kZXZlbC1zdW1taXQtc2NzaS10YXJnZXQtY2x1c3RlcmluZy5wZGY (Page 8) It seems to indicate that for a true active/active setup the two targets need to be aware of each other and exchange locking information for it to work reliably, I’ve also watched the video from the Ceph developer summit where this is discussed and it seems that Ceph+Kernel need changes to allow this locking to be pushed back to the RBD layer so it can be shared, from what I can see browsing through the Linux Git Repo, these patches haven’t made the mainline kernel yet. Can you shed any light on this? As tempting as having active/active is, I’m wary about using the configuration until I understand how the locking is working and if fringe cases involving multiple ESXi hosts writing to the same LUN on different targets could spell disaster. Many thanks, Nick *From:*Jake Young [mailto:jak3...@gmail.com] *Sent:* 14 January 2015 16:54 *To:* Nick Fisk *Cc:* Giuseppe Civitella; ceph-users *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone? Yes, it's active/active and I found that VMWare can switch from path to path with no issues or service impact. I posted some config files here: github.com/jak3kaj/misc http://xo4t.mjt.lu/link/xo4t/gzyhtx3/2/_P2HWj3RxQZC1v5DQ_206Q/aHR0cDovL2dpdGh1Yi5jb20vamFrM2thai9taXNj One set is from my LIO nodes, both the primary and secondary configs so you can see what I needed to make unique. The other set (targets.conf) are from my tgt nodes. They are both 4 LUN configs. Like I said in my previous email, there is no performance difference between LIO and tgt. The only service I'm running on these nodes is a single iscsi target instance (either LIO or tgt). Jake On Wed, Jan 14, 2015 at 8:41 AM, Nick Fisk n...@fisk.me.uk mailto:n...@fisk.me.uk wrote: Hi Jake, I can’t remember the exact details, but it was something to do with a potential problem when using the pacemaker resource agents. I think it was to do with a potential hanging issue when one LUN on a shared target failed and then it tried to kill all the other LUNS to fail the target over to another host. This then leaves the TCM part of LIO locking the RBD which also can’t fail over. That said I did try multiple LUNS on one target as a test and didn’t experience any problems. I’m interested in the way you have your setup configured though. Are you saying you effectively have an active/active configuration with a path going to either host, or are you failing the iSCSI IP between hosts? If it’s the former, have you had any problems with scsi locking/reservations…etc between the two targets? I can see the advantage to that configuration as you reduce/eliminate a lot of the troubles I have had with resources failing over. Nick *From:*Jake Young [mailto:jak3...@gmail.com mailto:jak3...@gmail.com] *Sent:* 14 January 2015 12:50 *To:* Nick Fisk *Cc
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
this setup without being aware of the risks involved. There have been a few upstream patches getting the LIO code in better cluster-aware shape, but no idea if they have been merged yet. I know RedHat has a guy on this. On 01/21/2015 02:40 PM, Nick Fisk wrote: Hi Jake, Thanks for this, I have been going through this and have a pretty good idea on what you are doing now, however I maybe missing something looking through your scripts, but I’m still not quite understanding how you are managing to make sure locking is happening with the ESXi ATS SCSI command. From this slide http://xo4t.mjt.lu/link/xo4t/gzyhtx3/1/_9gJVMUrSdvzGXYaZfCkVA/aHR0cHM6Ly93aWtpLmNlcGguY29tL0BhcGkvZGVraS9maWxlcy8zOC9oYW1tZXItY2VwaC1kZXZlbC1zdW1taXQtc2NzaS10YXJnZXQtY2x1c3RlcmluZy5wZGY (Page 8) It seems to indicate that for a true active/active setup the two targets need to be aware of each other and exchange locking information for it to work reliably, I’ve also watched the video from the Ceph developer summit where this is discussed and it seems that Ceph+Kernel need changes to allow this locking to be pushed back to the RBD layer so it can be shared, from what I can see browsing through the Linux Git Repo, these patches haven’t made the mainline kernel yet. Can you shed any light on this? As tempting as having active/active is, I’m wary about using the configuration until I understand how the locking is working and if fringe cases involving multiple ESXi hosts writing to the same LUN on different targets could spell disaster. Many thanks, Nick *From:*Jake Young [mailto:jak3...@gmail.com] *Sent:* 14 January 2015 16:54 *To:* Nick Fisk *Cc:* Giuseppe Civitella; ceph-users *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone? Yes, it's active/active and I found that VMWare can switch from path to path with no issues or service impact. I posted some config files here: github.com/jak3kaj/misc http://xo4t.mjt.lu/link/xo4t/gzyhtx3/2/_P2HWj3RxQZC1v5DQ_206Q/aHR0cDovL2dpdGh1Yi5jb20vamFrM2thai9taXNj One set is from my LIO nodes, both the primary and secondary configs so you can see what I needed to make unique. The other set (targets.conf) are from my tgt nodes. They are both 4 LUN configs. Like I said in my previous email, there is no performance difference between LIO and tgt. The only service I'm running on these nodes is a single iscsi target instance (either LIO or tgt). Jake On Wed, Jan 14, 2015 at 8:41 AM, Nick Fisk n...@fisk.me.uk mailto:n...@fisk.me.uk wrote: Hi Jake, I can’t remember the exact details, but it was something to do with a potential problem when using the pacemaker resource agents. I think it was to do with a potential hanging issue when one LUN on a shared target failed and then it tried to kill all the other LUNS to fail the target over to another host. This then leaves the TCM part of LIO locking the RBD which also can’t fail over. That said I did try multiple LUNS on one target as a test and didn’t experience any problems. I’m interested in the way you have your setup configured though. Are you saying you effectively have an active/active configuration with a path going to either host, or are you failing the iSCSI IP between hosts? If it’s the former, have you had any problems with scsi locking/reservations…etc between the two targets? I can see the advantage to that configuration as you reduce/eliminate a lot of the troubles I have had with resources failing over. Nick *From:*Jake Young [mailto:jak3...@gmail.com mailto:jak3...@gmail.com] *Sent:* 14 January 2015 12:50 *To:* Nick Fisk *Cc:* Giuseppe Civitella; ceph-users *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone? Nick, Where did you read that having more than 1 LUN per target causes stability problems? I am running 4 LUNs per target. For HA I'm running two linux iscsi target servers that map the same 4 rbd images. The two targets have the same serial numbers, T10 address, etc. I copy the primary's config to the backup and change IPs. This way VMWare thinks they are different target IPs on the same host. This has worked very well for me. One suggestion I have is to try using rbd enabled tgt. The performance
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Thanks for the feedback Nick and Zoltan, I have been seeing periodic kernel panics when I used LIO. It was either due to LIO or the kernel rbd mapping. I have seen this on Ubuntu precise with kernel 3.14.14 and again in Ubunty trusty with the utopic kernel (currently 3.16.0-28). Ironically, this is the primary reason I started exploring a redundancy solution for my iSCSI proxy node. So, yes, these crashes have nothing to do with running the Active/Active setup. I am moving my entire setup from LIO to rbd enabled tgt, which I've found to be much more stable and gives equivalent performance. I've been testing active/active LIO since July of 2014 with VMWare and I've never seen any vmfs corruption. I am now convinced (thanks Nick) that it is possible. The reason I have not seen any corruption may have to do with how VMWare happens to be configured. Originally, I had made a point to use round robin path selection in the VMware hosts; but as I did performance testing, I found that it actually didn't help performance. When the host switches iSCSI targets there is a short spin up time for LIO to get to 100% IO capability. Since round robin switches targets every 30 seconds (60 seconds? I forget), this seemed to be significant. A secondary goal for me was to end up with a config that required minimal tuning from VMWare and the target software; so the obvious choice is to leave VMWare's path selection at the default which is Fixed and picks the first target in ASCII-betical order. That means I am actually functioning in Active/Passive mode. Jake On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy zol...@linux.vnet.ibm.com wrote: Just to chime in: it will look fine, feel fine, but underneath it's quite easy to get VMFS corruption. Happened in our tests. Also if you're running LIO, from time to time expect a kernel panic (haven't tried with the latest upstream, as I've been using Ubuntu 14.04 on my export hosts for the test, so might have improved...). As of now I would not recommend this setup without being aware of the risks involved. There have been a few upstream patches getting the LIO code in better cluster-aware shape, but no idea if they have been merged yet. I know RedHat has a guy on this. On 01/21/2015 02:40 PM, Nick Fisk wrote: Hi Jake, Thanks for this, I have been going through this and have a pretty good idea on what you are doing now, however I maybe missing something looking through your scripts, but I’m still not quite understanding how you are managing to make sure locking is happening with the ESXi ATS SCSI command. From this slide http://xo4t.mjt.lu/link/xo4t/gzyhtx3/1/_9gJVMUrSdvzGXYaZfCkVA/aHR0cHM6Ly93aWtpLmNlcGguY29tL0BhcGkvZGVraS9maWxlcy8zOC9oYW1tZXItY2VwaC1kZXZlbC1zdW1taXQtc2NzaS10YXJnZXQtY2x1c3RlcmluZy5wZGY (Page 8) It seems to indicate that for a true active/active setup the two targets need to be aware of each other and exchange locking information for it to work reliably, I’ve also watched the video from the Ceph developer summit where this is discussed and it seems that Ceph+Kernel need changes to allow this locking to be pushed back to the RBD layer so it can be shared, from what I can see browsing through the Linux Git Repo, these patches haven’t made the mainline kernel yet. Can you shed any light on this? As tempting as having active/active is, I’m wary about using the configuration until I understand how the locking is working and if fringe cases involving multiple ESXi hosts writing to the same LUN on different targets could spell disaster. Many thanks, Nick *From:* Jake Young [mailto:jak3...@gmail.com jak3...@gmail.com] *Sent:* 14 January 2015 16:54 *To:* Nick Fisk *Cc:* Giuseppe Civitella; ceph-users *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone? Yes, it's active/active and I found that VMWare can switch from path to path with no issues or service impact. I posted some config files here: github.com/jak3kaj/misc http://xo4t.mjt.lu/link/xo4t/gzyhtx3/2/_P2HWj3RxQZC1v5DQ_206Q/aHR0cDovL2dpdGh1Yi5jb20vamFrM2thai9taXNj One set is from my LIO nodes, both the primary and secondary configs so you can see what I needed to make unique. The other set (targets.conf) are from my tgt nodes. They are both 4 LUN configs. Like I said in my previous email, there is no performance difference between LIO and tgt. The only service I'm running on these nodes is a single iscsi target instance (either LIO or tgt). Jake On Wed, Jan 14, 2015 at 8:41 AM, Nick Fisk n...@fisk.me.uk wrote: Hi Jake, I can’t remember the exact details, but it was something to do with a potential problem when using the pacemaker resource agents. I think it was to do with a potential hanging issue when one LUN on a shared target failed and then it tried to kill all the other LUNS to fail the target over to another host. This then leaves the TCM part of LIO locking the RBD which also can’t
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Thanks for your responses guys, I’ve been spending a lot of time looking at this recently and I think I’m even more confused than when I started. I been looking at trying to adapt a resource agent made by tiger computing (https://github.com/tigercomputing/ocf-lio) to create a HA LIO failover target, Instead of going with the Virtual IP failover method it manipulates the ALUA states to present active/standby paths. It’s very complicated and am close to giving up. What do you reckon accept defeat and go with a much simpler tgt and virtual IP failover solution for time being until the Redhat patches make their way into the kernel? From: Jake Young [mailto:jak3...@gmail.com] Sent: 23 January 2015 16:46 To: Zoltan Arnold Nagy Cc: Nick Fisk; ceph-users Subject: Re: [ceph-users] Ceph, LIO, VMWARE anyone? Thanks for the feedback Nick and Zoltan, I have been seeing periodic kernel panics when I used LIO. It was either due to LIO or the kernel rbd mapping. I have seen this on Ubuntu precise with kernel 3.14.14 and again in Ubunty trusty with the utopic kernel (currently 3.16.0-28). Ironically, this is the primary reason I started exploring a redundancy solution for my iSCSI proxy node. So, yes, these crashes have nothing to do with running the Active/Active setup. I am moving my entire setup from LIO to rbd enabled tgt, which I've found to be much more stable and gives equivalent performance. I've been testing active/active LIO since July of 2014 with VMWare and I've never seen any vmfs corruption. I am now convinced (thanks Nick) that it is possible. The reason I have not seen any corruption may have to do with how VMWare happens to be configured. Originally, I had made a point to use round robin path selection in the VMware hosts; but as I did performance testing, I found that it actually didn't help performance. When the host switches iSCSI targets there is a short spin up time for LIO to get to 100% IO capability. Since round robin switches targets every 30 seconds (60 seconds? I forget), this seemed to be significant. A secondary goal for me was to end up with a config that required minimal tuning from VMWare and the target software; so the obvious choice is to leave VMWare's path selection at the default which is Fixed and picks the first target in ASCII-betical order. That means I am actually functioning in Active/Passive mode. Jake On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy zol...@linux.vnet.ibm.com mailto:zol...@linux.vnet.ibm.com wrote: Just to chime in: it will look fine, feel fine, but underneath it's quite easy to get VMFS corruption. Happened in our tests. Also if you're running LIO, from time to time expect a kernel panic (haven't tried with the latest upstream, as I've been using Ubuntu 14.04 on my export hosts for the test, so might have improved...). As of now I would not recommend this setup without being aware of the risks involved. There have been a few upstream patches getting the LIO code in better cluster-aware shape, but no idea if they have been merged yet. I know RedHat has a guy on this. On 01/21/2015 02:40 PM, Nick Fisk wrote: Hi Jake, Thanks for this, I have been going through this and have a pretty good idea on what you are doing now, however I maybe missing something looking through your scripts, but I’m still not quite understanding how you are managing to make sure locking is happening with the ESXi ATS SCSI command. From this slide http://xo4t.mjt.lu/link/xo4t/gzyhtx3/1/_9gJVMUrSdvzGXYaZfCkVA/aHR0cHM6Ly93aWtpLmNlcGguY29tL0BhcGkvZGVraS9maWxlcy8zOC9oYW1tZXItY2VwaC1kZXZlbC1zdW1taXQtc2NzaS10YXJnZXQtY2x1c3RlcmluZy5wZGY (Page 8) It seems to indicate that for a true active/active setup the two targets need to be aware of each other and exchange locking information for it to work reliably, I’ve also watched the video from the Ceph developer summit where this is discussed and it seems that Ceph+Kernel need changes to allow this locking to be pushed back to the RBD layer so it can be shared, from what I can see browsing through the Linux Git Repo, these patches haven’t made the mainline kernel yet. Can you shed any light on this? As tempting as having active/active is, I’m wary about using the configuration until I understand how the locking is working and if fringe cases involving multiple ESXi hosts writing to the same LUN on different targets could spell disaster. Many thanks, Nick From: Jake Young [mailto:jak3...@gmail.com] Sent: 14 January 2015 16:54 To: Nick Fisk Cc: Giuseppe Civitella; ceph-users Subject: Re: [ceph-users] Ceph, LIO, VMWARE anyone? Yes, it's active/active and I found that VMWare can switch from path to path with no issues or service impact. I posted some config files here: github.com/jak3kaj/misc http://xo4t.mjt.lu/link/xo4t/gzyhtx3/2/_P2HWj3RxQZC1v5DQ_206Q/aHR0cDovL2dpdGh1Yi5jb20vamFrM2thai9taXNj
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Correct me if I'm wrong, but tgt doesn't have full SCSI-3 persistence support when _not_ using the LIO backend for it, right? AFAIK you can either run tgt with it's own iSCSI implementation or you can use tgt to manage your LIO targets. I assume when you're running tgt with the rbd backend code you're skipping all the in-kernel LIO parts (in which case the RedHat patches won't help a bit), and you won't have proper active-active support, since the initiators have no way to synchronize state (and more importantly, no way to synchronize write caching! [I can think of some really ugly hacks to get around that, tho...]). On 01/23/2015 05:46 PM, Jake Young wrote: Thanks for the feedback Nick and Zoltan, I have been seeing periodic kernel panics when I used LIO. It was either due to LIO or the kernel rbd mapping. I have seen this on Ubuntu precise with kernel 3.14.14 and again in Ubunty trusty with the utopic kernel (currently 3.16.0-28). Ironically, this is the primary reason I started exploring a redundancy solution for my iSCSI proxy node. So, yes, these crashes have nothing to do with running the Active/Active setup. I am moving my entire setup from LIO to rbd enabled tgt, which I've found to be much more stable and gives equivalent performance. I've been testing active/active LIO since July of 2014 with VMWare and I've never seen any vmfs corruption. I am now convinced (thanks Nick) that it is possible. The reason I have not seen any corruption may have to do with how VMWare happens to be configured. Originally, I had made a point to use round robin path selection in the VMware hosts; but as I did performance testing, I found that it actually didn't help performance. When the host switches iSCSI targets there is a short spin up time for LIO to get to 100% IO capability. Since round robin switches targets every 30 seconds (60 seconds? I forget), this seemed to be significant. A secondary goal for me was to end up with a config that required minimal tuning from VMWare and the target software; so the obvious choice is to leave VMWare's path selection at the default which is Fixed and picks the first target in ASCII-betical order. That means I am actually functioning in Active/Passive mode. Jake On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy zol...@linux.vnet.ibm.com mailto:zol...@linux.vnet.ibm.com wrote: Just to chime in: it will look fine, feel fine, but underneath it's quite easy to get VMFS corruption. Happened in our tests. Also if you're running LIO, from time to time expect a kernel panic (haven't tried with the latest upstream, as I've been using Ubuntu 14.04 on my export hosts for the test, so might have improved...). As of now I would not recommend this setup without being aware of the risks involved. There have been a few upstream patches getting the LIO code in better cluster-aware shape, but no idea if they have been merged yet. I know RedHat has a guy on this. On 01/21/2015 02:40 PM, Nick Fisk wrote: Hi Jake, Thanks for this, I have been going through this and have a pretty good idea on what you are doing now, however I maybe missing something looking through your scripts, but I’m still not quite understanding how you are managing to make sure locking is happening with the ESXi ATS SCSI command. From this slide http://xo4t.mjt.lu/link/xo4t/gzyhtx3/1/_9gJVMUrSdvzGXYaZfCkVA/aHR0cHM6Ly93aWtpLmNlcGguY29tL0BhcGkvZGVraS9maWxlcy8zOC9oYW1tZXItY2VwaC1kZXZlbC1zdW1taXQtc2NzaS10YXJnZXQtY2x1c3RlcmluZy5wZGY (Page 8) It seems to indicate that for a true active/active setup the two targets need to be aware of each other and exchange locking information for it to work reliably, I’ve also watched the video from the Ceph developer summit where this is discussed and it seems that Ceph+Kernel need changes to allow this locking to be pushed back to the RBD layer so it can be shared, from what I can see browsing through the Linux Git Repo, these patches haven’t made the mainline kernel yet. Can you shed any light on this? As tempting as having active/active is, I’m wary about using the configuration until I understand how the locking is working and if fringe cases involving multiple ESXi hosts writing to the same LUN on different targets could spell disaster. Many thanks, Nick *From:*Jake Young [mailto:jak3...@gmail.com] *Sent:* 14 January 2015 16:54 *To:* Nick Fisk *Cc:* Giuseppe Civitella; ceph-users *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone? Yes, it's active/active and I found that VMWare can switch from path to path with no issues or service impact. I posted some config files here: github.com/jak3kaj/misc http://xo4t.mjt.lu/link/xo4t/gzyhtx3/2/_P2HWj3RxQZC1v5DQ_206Q/aHR0cDovL2dpdGh1Yi5jb20vamFrM2thai9taXNj One set is from my LIO nodes, both
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
I would go with tgt regardless of your HA solution. I tried to use LIO for a long time and am glad I finally seriously tested tgt. Two big reasons are 1) latest rbd code will be in tgt 2) two less reasons for a kernel panic in the proxy node (rbd and iscsi) For me, I'm comfortable with how my system is configured with the Active/Passive config. This only because of the network architecture and the fact that I administer the ESXi hosts. I also have separate rbd disks for each environment, so if I do get VMFS corruption, it is isolated to one system. Another thing I forgot is that I disabled all the VAAI accelleration based on this advice when using tgt: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/039670.html I was having poor performance with VAAI turned on and tgt. LIO performed the same with or without VAAI for my workload. I'm not sure if that changes the way VMFS locking works enough to sidestep the issue. I think that I'm falling back to just persistent SCSI reservations instead of ATS. I think I'm still open to corruption for the same reason. See here if you haven't already for more details on VMFS locking: http://blogs.vmware.com/vsphere/2012/05/vmfs-locking-uncovered.html Jake On Friday, January 23, 2015, Nick Fisk n...@fisk.me.uk wrote: Thanks for your responses guys, I’ve been spending a lot of time looking at this recently and I think I’m even more confused than when I started. I been looking at trying to adapt a resource agent made by tiger computing ( http://xo4t.mjt.lu/link/xo4t/gv9y7rs/1/7MG13jwJZd0R-D8FrJljFA/aHR0cHM6Ly9naXRodWIuY29tL3RpZ2VyY29tcHV0aW5nL29jZi1saW8) to create a HA LIO failover target, Instead of going with the Virtual IP failover method it manipulates the ALUA states to present active/standby paths. It’s very complicated and am close to giving up. What do you reckon accept defeat and go with a much simpler tgt and virtual IP failover solution for time being until the Redhat patches make their way into the kernel? *From:* Jake Young [mailto:jak3...@gmail.com javascript:_e(%7B%7D,'cvml','jak3...@gmail.com');] *Sent:* 23 January 2015 16:46 *To:* Zoltan Arnold Nagy *Cc:* Nick Fisk; ceph-users *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone? Thanks for the feedback Nick and Zoltan, I have been seeing periodic kernel panics when I used LIO. It was either due to LIO or the kernel rbd mapping. I have seen this on Ubuntu precise with kernel 3.14.14 and again in Ubunty trusty with the utopic kernel (currently 3.16.0-28). Ironically, this is the primary reason I started exploring a redundancy solution for my iSCSI proxy node. So, yes, these crashes have nothing to do with running the Active/Active setup. I am moving my entire setup from LIO to rbd enabled tgt, which I've found to be much more stable and gives equivalent performance. I've been testing active/active LIO since July of 2014 with VMWare and I've never seen any vmfs corruption. I am now convinced (thanks Nick) that it is possible. The reason I have not seen any corruption may have to do with how VMWare happens to be configured. Originally, I had made a point to use round robin path selection in the VMware hosts; but as I did performance testing, I found that it actually didn't help performance. When the host switches iSCSI targets there is a short spin up time for LIO to get to 100% IO capability. Since round robin switches targets every 30 seconds (60 seconds? I forget), this seemed to be significant. A secondary goal for me was to end up with a config that required minimal tuning from VMWare and the target software; so the obvious choice is to leave VMWare's path selection at the default which is Fixed and picks the first target in ASCII-betical order. That means I am actually functioning in Active/Passive mode. Jake On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy zol...@linux.vnet.ibm.com javascript:_e(%7B%7D,'cvml','zol...@linux.vnet.ibm.com'); wrote: Just to chime in: it will look fine, feel fine, but underneath it's quite easy to get VMFS corruption. Happened in our tests. Also if you're running LIO, from time to time expect a kernel panic (haven't tried with the latest upstream, as I've been using Ubuntu 14.04 on my export hosts for the test, so might have improved...). As of now I would not recommend this setup without being aware of the risks involved. There have been a few upstream patches getting the LIO code in better cluster-aware shape, but no idea if they have been merged yet. I know RedHat has a guy on this. On 01/21/2015 02:40 PM, Nick Fisk wrote: Hi Jake, Thanks for this, I have been going through this and have a pretty good idea on what you are doing now, however I maybe missing something looking through your scripts, but I’m still not quite understanding how you are managing to make sure locking is happening with the ESXi ATS SCSI command. From
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Yes, it's active/active and I found that VMWare can switch from path to path with no issues or service impact. I posted some config files here: github.com/jak3kaj/misc One set is from my LIO nodes, both the primary and secondary configs so you can see what I needed to make unique. The other set (targets.conf) are from my tgt nodes. They are both 4 LUN configs. Like I said in my previous email, there is no performance difference between LIO and tgt. The only service I'm running on these nodes is a single iscsi target instance (either LIO or tgt). Jake On Wed, Jan 14, 2015 at 8:41 AM, Nick Fisk n...@fisk.me.uk wrote: Hi Jake, I can’t remember the exact details, but it was something to do with a potential problem when using the pacemaker resource agents. I think it was to do with a potential hanging issue when one LUN on a shared target failed and then it tried to kill all the other LUNS to fail the target over to another host. This then leaves the TCM part of LIO locking the RBD which also can’t fail over. That said I did try multiple LUNS on one target as a test and didn’t experience any problems. I’m interested in the way you have your setup configured though. Are you saying you effectively have an active/active configuration with a path going to either host, or are you failing the iSCSI IP between hosts? If it’s the former, have you had any problems with scsi locking/reservations…etc between the two targets? I can see the advantage to that configuration as you reduce/eliminate a lot of the troubles I have had with resources failing over. Nick *From:* Jake Young [mailto:jak3...@gmail.com] *Sent:* 14 January 2015 12:50 *To:* Nick Fisk *Cc:* Giuseppe Civitella; ceph-users *Subject:* Re: [ceph-users] Ceph, LIO, VMWARE anyone? Nick, Where did you read that having more than 1 LUN per target causes stability problems? I am running 4 LUNs per target. For HA I'm running two linux iscsi target servers that map the same 4 rbd images. The two targets have the same serial numbers, T10 address, etc. I copy the primary's config to the backup and change IPs. This way VMWare thinks they are different target IPs on the same host. This has worked very well for me. One suggestion I have is to try using rbd enabled tgt. The performance is equivalent to LIO, but I found it is much better at recovering from a cluster outage. I've had LIO lock up the kernel or simply not recognize that the rbd images are available; where tgt will eventually present the rbd images again. I have been slowly adding servers and am expanding my test setup to a production setup (nice thing about ceph). I now have 6 OSD hosts with 7 disks on each. I'm using the LSI Nytro cache raid controller, so I don't have a separate journal and have 40Gb networking. I plan to add another 6 OSD hosts in another rack in the next 6 months (and then another 6 next year). I'm doing 3x replication, so I want to end up with 3 racks. Jake On Wednesday, January 14, 2015, Nick Fisk n...@fisk.me.uk wrote: Hi Giuseppe, I am working on something very similar at the moment. I currently have it working on some test hardware but seems to be working reasonably well. I say reasonably as I have had a few instability’s but these are on the HA side, the LIO and RBD side of things have been rock solid so far. The main problems I have had seem to be around recovering from failure with resources ending up in a unmanaged state. I’m not currently using fencing so this may be part of the cause. As a brief description of my configuration. 4 Hosts each having 2 OSD’s also running the monitor role 3 additional host in a HA cluster which act as iSCSI proxy nodes. I’m using the IP, RBD, iSCSITarget and iSCSILUN resource agents to provide HA iSCSI LUN which maps back to a RBD. All the agents for each RBD are in a group so they follow each other between hosts. I’m using 1 LUN per target as I read somewhere there are stability problems using more than 1 LUN per target. Performance seems ok, I can get about 1.2k random IO’s out the iSCSI LUN. These seems to be about right for the Ceph cluster size, so I don’t think the LIO part is causing any significant overhead. We should be getting our production hardware shortly which wil have 40 OSD’s with journals and a SSD caching tier, so within the next month or so I will have a better idea of running it in a production environment and the performance of the system. Hope that helps, if you have any questions, please let me know. Nick *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Giuseppe Civitella *Sent:* 13 January 2015 11:23 *To:* ceph-users *Subject:* [ceph-users] Ceph, LIO, VMWARE anyone? Hi all, I'm working on a lab setup regarding Ceph serving rbd images as ISCSI datastores to VMWARE via a LIO box. Is there someone that already did something similar
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Nick, Where did you read that having more than 1 LUN per target causes stability problems? I am running 4 LUNs per target. For HA I'm running two linux iscsi target servers that map the same 4 rbd images. The two targets have the same serial numbers, T10 address, etc. I copy the primary's config to the backup and change IPs. This way VMWare thinks they are different target IPs on the same host. This has worked very well for me. One suggestion I have is to try using rbd enabled tgt. The performance is equivalent to LIO, but I found it is much better at recovering from a cluster outage. I've had LIO lock up the kernel or simply not recognize that the rbd images are available; where tgt will eventually present the rbd images again. I have been slowly adding servers and am expanding my test setup to a production setup (nice thing about ceph). I now have 6 OSD hosts with 7 disks on each. I'm using the LSI Nytro cache raid controller, so I don't have a separate journal and have 40Gb networking. I plan to add another 6 OSD hosts in another rack in the next 6 months (and then another 6 next year). I'm doing 3x replication, so I want to end up with 3 racks. Jake On Wednesday, January 14, 2015, Nick Fisk n...@fisk.me.uk wrote: Hi Giuseppe, I am working on something very similar at the moment. I currently have it working on some test hardware but seems to be working reasonably well. I say reasonably as I have had a few instability’s but these are on the HA side, the LIO and RBD side of things have been rock solid so far. The main problems I have had seem to be around recovering from failure with resources ending up in a unmanaged state. I’m not currently using fencing so this may be part of the cause. As a brief description of my configuration. 4 Hosts each having 2 OSD’s also running the monitor role 3 additional host in a HA cluster which act as iSCSI proxy nodes. I’m using the IP, RBD, iSCSITarget and iSCSILUN resource agents to provide HA iSCSI LUN which maps back to a RBD. All the agents for each RBD are in a group so they follow each other between hosts. I’m using 1 LUN per target as I read somewhere there are stability problems using more than 1 LUN per target. Performance seems ok, I can get about 1.2k random IO’s out the iSCSI LUN. These seems to be about right for the Ceph cluster size, so I don’t think the LIO part is causing any significant overhead. We should be getting our production hardware shortly which wil have 40 OSD’s with journals and a SSD caching tier, so within the next month or so I will have a better idea of running it in a production environment and the performance of the system. Hope that helps, if you have any questions, please let me know. Nick *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com javascript:_e(%7B%7D,'cvml','ceph-users-boun...@lists.ceph.com');] *On Behalf Of *Giuseppe Civitella *Sent:* 13 January 2015 11:23 *To:* ceph-users *Subject:* [ceph-users] Ceph, LIO, VMWARE anyone? Hi all, I'm working on a lab setup regarding Ceph serving rbd images as ISCSI datastores to VMWARE via a LIO box. Is there someone that already did something similar wanting to share some knowledge? Any production deployments? What about LIO's HA and luns' performances? Thanks Giuseppe ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Hi Jake, I can’t remember the exact details, but it was something to do with a potential problem when using the pacemaker resource agents. I think it was to do with a potential hanging issue when one LUN on a shared target failed and then it tried to kill all the other LUNS to fail the target over to another host. This then leaves the TCM part of LIO locking the RBD which also can’t fail over. That said I did try multiple LUNS on one target as a test and didn’t experience any problems. I’m interested in the way you have your setup configured though. Are you saying you effectively have an active/active configuration with a path going to either host, or are you failing the iSCSI IP between hosts? If it’s the former, have you had any problems with scsi locking/reservations…etc between the two targets? I can see the advantage to that configuration as you reduce/eliminate a lot of the troubles I have had with resources failing over. Nick From: Jake Young [mailto:jak3...@gmail.com] Sent: 14 January 2015 12:50 To: Nick Fisk Cc: Giuseppe Civitella; ceph-users Subject: Re: [ceph-users] Ceph, LIO, VMWARE anyone? Nick, Where did you read that having more than 1 LUN per target causes stability problems? I am running 4 LUNs per target. For HA I'm running two linux iscsi target servers that map the same 4 rbd images. The two targets have the same serial numbers, T10 address, etc. I copy the primary's config to the backup and change IPs. This way VMWare thinks they are different target IPs on the same host. This has worked very well for me. One suggestion I have is to try using rbd enabled tgt. The performance is equivalent to LIO, but I found it is much better at recovering from a cluster outage. I've had LIO lock up the kernel or simply not recognize that the rbd images are available; where tgt will eventually present the rbd images again. I have been slowly adding servers and am expanding my test setup to a production setup (nice thing about ceph). I now have 6 OSD hosts with 7 disks on each. I'm using the LSI Nytro cache raid controller, so I don't have a separate journal and have 40Gb networking. I plan to add another 6 OSD hosts in another rack in the next 6 months (and then another 6 next year). I'm doing 3x replication, so I want to end up with 3 racks. Jake On Wednesday, January 14, 2015, Nick Fisk n...@fisk.me.uk mailto:n...@fisk.me.uk wrote: Hi Giuseppe, I am working on something very similar at the moment. I currently have it working on some test hardware but seems to be working reasonably well. I say reasonably as I have had a few instability’s but these are on the HA side, the LIO and RBD side of things have been rock solid so far. The main problems I have had seem to be around recovering from failure with resources ending up in a unmanaged state. I’m not currently using fencing so this may be part of the cause. As a brief description of my configuration. 4 Hosts each having 2 OSD’s also running the monitor role 3 additional host in a HA cluster which act as iSCSI proxy nodes. I’m using the IP, RBD, iSCSITarget and iSCSILUN resource agents to provide HA iSCSI LUN which maps back to a RBD. All the agents for each RBD are in a group so they follow each other between hosts. I’m using 1 LUN per target as I read somewhere there are stability problems using more than 1 LUN per target. Performance seems ok, I can get about 1.2k random IO’s out the iSCSI LUN. These seems to be about right for the Ceph cluster size, so I don’t think the LIO part is causing any significant overhead. We should be getting our production hardware shortly which wil have 40 OSD’s with journals and a SSD caching tier, so within the next month or so I will have a better idea of running it in a production environment and the performance of the system. Hope that helps, if you have any questions, please let me know. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com javascript:_e(%7B%7D,'cvml','ceph-users-boun...@lists.ceph.com'); ] On Behalf Of Giuseppe Civitella Sent: 13 January 2015 11:23 To: ceph-users Subject: [ceph-users] Ceph, LIO, VMWARE anyone? Hi all, I'm working on a lab setup regarding Ceph serving rbd images as ISCSI datastores to VMWARE via a LIO box. Is there someone that already did something similar wanting to share some knowledge? Any production deployments? What about LIO's HA and luns' performances? Thanks Giuseppe http://xo4t.mjt.lu/o/xo4t/ffe5c988/gpqqil1e.gif ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Guiseppe, despite the fact I like SCST, I did a comparable setup with LIO (and the respective RBD LIO Backend) in userspace. It spans over at least three bridge nodes without any problems. In contrast to usual (two controller, one backplane) iSCSI portals, I have to discover every single portal on it's own. So the multipath-part is a little more work, but not that challenging... After a little more research, I'm going to hold a talk at the Chemnitzer Linuxtage about this topic. Please leave me a PM if you wan't to peek into the (currently not completely ready) slides. I'm working on a lab setup regarding Ceph serving rbd images as ISCSI datastores to VMWARE via a LIO box. Is there someone that already did something similar wanting to share some knowledge? Any production deployments? What about LIO's HA and luns' performances? Mit freundlichen Grüßen, Stephan Seitz -- Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-44 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin signature.asc Description: This is a digitally signed message part ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph, LIO, VMWARE anyone?
Hi Giuseppe, I am working on something very similar at the moment. I currently have it working on some test hardware but seems to be working reasonably well. I say reasonably as I have had a few instability’s but these are on the HA side, the LIO and RBD side of things have been rock solid so far. The main problems I have had seem to be around recovering from failure with resources ending up in a unmanaged state. I’m not currently using fencing so this may be part of the cause. As a brief description of my configuration. 4 Hosts each having 2 OSD’s also running the monitor role 3 additional host in a HA cluster which act as iSCSI proxy nodes. I’m using the IP, RBD, iSCSITarget and iSCSILUN resource agents to provide HA iSCSI LUN which maps back to a RBD. All the agents for each RBD are in a group so they follow each other between hosts. I’m using 1 LUN per target as I read somewhere there are stability problems using more than 1 LUN per target. Performance seems ok, I can get about 1.2k random IO’s out the iSCSI LUN. These seems to be about right for the Ceph cluster size, so I don’t think the LIO part is causing any significant overhead. We should be getting our production hardware shortly which wil have 40 OSD’s with journals and a SSD caching tier, so within the next month or so I will have a better idea of running it in a production environment and the performance of the system. Hope that helps, if you have any questions, please let me know. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Giuseppe Civitella Sent: 13 January 2015 11:23 To: ceph-users Subject: [ceph-users] Ceph, LIO, VMWARE anyone? Hi all, I'm working on a lab setup regarding Ceph serving rbd images as ISCSI datastores to VMWARE via a LIO box. Is there someone that already did something similar wanting to share some knowledge? Any production deployments? What about LIO's HA and luns' performances? Thanks Giuseppe ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com