Re: [ceph-users] rbd iscsi gateway question

2017-04-12 Thread Cédric Lemarchand
On Mon, 2017-04-10 at 12:13 -0500, Mike Christie wrote:
> 
> > LIO-TCMU+librbd-iscsi [1] [2] looks really promising and seams to
> > be the
> > way to go. It would be great if somebody as insight about the
> > maturity
> > of the project, is it ready for testing purposes ?
> > 
> 
> It is not mature yet. You can do IO to a rbd image, but it currently
> does a queue depth of only 1.
> 
> We are in the process of merging patches from a couple branches to
> add
> rbd aio support, failover/failback across gateways, perf
> improvements,
> and lots of bug fixes. With them, linux works well, and we are
> working
> on a couple windows bugs.
> 
> For ESX, we are hoping to be ready around the end of summer. You
> should
> not use ESX with tcmu/tcmu-runner right now, because several commands
> are not implemented or implemented incorrectly for ESX.

Thanks Mike, much appreciated. Any pointers/URL to stay informed about
the progress ?

Cheers,

Cédric

  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-10 Thread Mike Christie
On 04/10/2017 01:21 PM, Timofey Titovets wrote:
> JFYI: Today we get totaly stable setup Ceph + ESXi "without hacks" and
> this pass stress tests.
> 
> 1. Don't try pass RBD directly to LIO, this setup are unstable
> 2. Instead of that, use Qemu + KVM (i use proxmox for that create VM)
> 3. Attach RBD to VM as VIRTIO-SCSI disk (must be exported by 
> target_core_iblock)

I think you avoid the hung command problem, because lio uses the
local/initiator side scsi layer to send commands to the virtio-scsi
device which has timeouts similar to ESX. They will timeout and fire the
virtio-scsi error handler, and commands will not just hang.

I think you can now do something similar with Ilya's patch and use krbd
directly with target_core_iblock:

https://www.spinics.net/lists/ceph-devel/msg35618.html

> 4. Make a LIO Target in VM
> 4.1 Sync Iniciator (ESXi) and target (LIO) options (best change Target 
> options)
> 4.2 You can enable almost all VAAI (also emulate_tpu=1, emulate_tpws=1)
> 4.3 For performance reason use noop on RBD disk in VM and set
> is_nonrot=1 (disable ESXi sheduller)
> 5. ESXi are "stupid" and have a problem with CAS on LIO (and some
> other storage vendors (google for info)), so for stable working
> without disconects of LUN set VMFS3.UseATSForHBOnVMFS5 to ZERO on All
> ESXi that use this lun.
> 6. Don't try make Target HA (not tested but i think you will catch
> problems with VMFS), you must do something like VM HA for that.
> 

Yes, the problem is for HA where commands need to be cleaned up before
they are retried through different GWs/paths, so one command is not
racing with the retry and new commands.

> This setup tested with latest ESXi and VMFS6.
> 
> Thanks.
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-10 Thread Timofey Titovets
JFYI: Today we get totaly stable setup Ceph + ESXi "without hacks" and
this pass stress tests.

1. Don't try pass RBD directly to LIO, this setup are unstable
2. Instead of that, use Qemu + KVM (i use proxmox for that create VM)
3. Attach RBD to VM as VIRTIO-SCSI disk (must be exported by target_core_iblock)
4. Make a LIO Target in VM
4.1 Sync Iniciator (ESXi) and target (LIO) options (best change Target options)
4.2 You can enable almost all VAAI (also emulate_tpu=1, emulate_tpws=1)
4.3 For performance reason use noop on RBD disk in VM and set
is_nonrot=1 (disable ESXi sheduller)
5. ESXi are "stupid" and have a problem with CAS on LIO (and some
other storage vendors (google for info)), so for stable working
without disconects of LUN set VMFS3.UseATSForHBOnVMFS5 to ZERO on All
ESXi that use this lun.
6. Don't try make Target HA (not tested but i think you will catch
problems with VMFS), you must do something like VM HA for that.

This setup tested with latest ESXi and VMFS6.

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-10 Thread Mike Christie
On 04/06/2017 08:46 AM, David Disseldorp wrote:
> On Thu, 6 Apr 2017 14:27:01 +0100, Nick Fisk wrote:
> ...
>>> I'm not to sure what you're referring to WRT the spiral of death, but we did
>>> patch some LIO issues encountered when a command was aborted while
>>> outstanding at the LIO backstore layer.
>>> These specific fixes are carried in the mainline kernel, and can be tested
>>> using the AbortTaskSimpleAsync libiscsi test.  
>>
>> Awesome, glad this has finally been fixed. Death spiral was referring to 
>> when using it with ESXi, both the initiator and target effectively hang 
>> forever and if you didn't catch it soon enough, sometimes you end up having 
>> to kill all vm's and reboot hosts.
> 
> Sounds like it could be the same thing. Stale iSCSI sessions remain
> around which block subsequent login attempts.
> 
>> Do you know what kernel version these changes would have first gone into? I 
>> thought I looked back into this last summer and it was still showing the 
>> same behavior.
> 
> The fix I was referring to is:
> commit 5e2c956b8aa24d4f33ff7afef92d409eed164746
> Author: Nicholas Bellinger 
> Date:   Wed May 25 12:25:04 2016 -0700
> 
> target: Fix missing complete during ABORT_TASK + CMD_T_FABRIC_STOP
> 
> It's carried in v4.8+ and was also flagged for 3.14+ stable inclusion,
> so should be present in many distro kernels by now. That said, there
> have been many other changes in this area.
> 

I think we can still hit the issue with this patch. The general problem
is handling commands that are going to take longer than the initiator
side's error handler. ESX will end up marking the VM/storage as failed
and the user has to manually intervene. It is similar to linux where a
/dev/sdX is marked offline, and the user has to then manually online it
and restart layers above it.

So we should root cause the reason for commands taking so long. If it is
just a normal case, then to handle this issue in a more generic way for
all initiators, Nick suggested to implement a target side timeout:

https://www.spinics.net/lists/target-devel/msg14780.html

In tcmu-runner we could then abort/kill the command based on a timer
there and then fail the command before the ESX timers fire. The
difficult part is of course aborting a running rbd command.

Note that you can currently set the tcmu timeout:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/target/target_core_user.c?id=7d7a743543905a8297dce53b36e793e5307da5d7

discussed in that thread and you will avoid the problem, but there is no
code to stop the running command in tcmu-runner, so it would not be safe
in some setups.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-10 Thread Mike Christie
On 04/06/2017 03:22 AM, yipik...@gmail.com wrote:
> On 06/04/2017 09:42, Nick Fisk wrote:
>>
>> I assume Brady is referring to the death spiral LIO gets into with
>> some initiators, including vmware, if an IO takes longer than about
>> 10s. I haven’t heard of anything, and can’t see any changes, so I
>> would assume this issue still remains.
>>
>>  
>>
>> I would look at either SCST or NFS for now.
>>
> LIO-TCMU+librbd-iscsi [1] [2] looks really promising and seams to be the
> way to go. It would be great if somebody as insight about the maturity
> of the project, is it ready for testing purposes ?
> 

It is not mature yet. You can do IO to a rbd image, but it currently
does a queue depth of only 1.

We are in the process of merging patches from a couple branches to add
rbd aio support, failover/failback across gateways, perf improvements,
and lots of bug fixes. With them, linux works well, and we are working
on a couple windows bugs.

For ESX, we are hoping to be ready around the end of summer. You should
not use ESX with tcmu/tcmu-runner right now, because several commands
are not implemented or implemented incorrectly for ESX.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread David Disseldorp
On Thu, 6 Apr 2017 14:27:01 +0100, Nick Fisk wrote:
...
> > I'm not to sure what you're referring to WRT the spiral of death, but we did
> > patch some LIO issues encountered when a command was aborted while
> > outstanding at the LIO backstore layer.
> > These specific fixes are carried in the mainline kernel, and can be tested
> > using the AbortTaskSimpleAsync libiscsi test.  
> 
> Awesome, glad this has finally been fixed. Death spiral was referring to when 
> using it with ESXi, both the initiator and target effectively hang forever 
> and if you didn't catch it soon enough, sometimes you end up having to kill 
> all vm's and reboot hosts.

Sounds like it could be the same thing. Stale iSCSI sessions remain
around which block subsequent login attempts.

> Do you know what kernel version these changes would have first gone into? I 
> thought I looked back into this last summer and it was still showing the same 
> behavior.

The fix I was referring to is:
commit 5e2c956b8aa24d4f33ff7afef92d409eed164746
Author: Nicholas Bellinger 
Date:   Wed May 25 12:25:04 2016 -0700

target: Fix missing complete during ABORT_TASK + CMD_T_FABRIC_STOP

It's carried in v4.8+ and was also flagged for 3.14+ stable inclusion,
so should be present in many distro kernels by now. That said, there
have been many other changes in this area.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Maged Mokhtar
We were beta till early Feb. so we are relatively young. If there are 
issues/bugs, we'd certainly be interested to know through our forum. Note that 
with us you can always use the cli and bypass the UI and it will be straight 
Ceph/LIO commands if you wish.



From: Brady Deetz 
Sent: Thursday, April 06, 2017 3:21 PM
To: ceph-users 
Subject: Re: [ceph-users] rbd iscsi gateway question


I appreciate everybody's responses here. I remember the announcement of Petasan 
a whole back on here and some concerns about it.  


Is anybody using it in production yet? 


On Apr 5, 2017 9:58 PM, "Brady Deetz" <bde...@gmail.com> wrote:

  I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging? 


  I'm attempting to determine if I have to move off of VMWare in order to 
safely use Ceph as my VM storage.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Nick Fisk
> -Original Message-
> From: David Disseldorp [mailto:dd...@suse.de]
> Sent: 06 April 2017 14:06
> To: Nick Fisk <n...@fisk.me.uk>
> Cc: 'Maged Mokhtar' <mmokh...@petasan.org>; 'Brady Deetz'
> <bde...@gmail.com>; 'ceph-users' <ceph-us...@ceph.com>
> Subject: Re: [ceph-users] rbd iscsi gateway question
> 
> X-Assp-URIBL failed: 'suse.de'(black.uribl.com )
> X-Assp-Spam-Level: *
> X-Assp-Envelope-From: dd...@suse.de
> X-Assp-Intended-For: n...@fisk.me.uk
> X-Assp-ID: ASSP.fisk.me.uk (49148-08075)
> X-Assp-Version: 1.9.1.4(1.0.00)
> 
> Hi,
> 
> On Thu, 6 Apr 2017 13:31:00 +0100, Nick Fisk wrote:
> 
> > > I believe there
> > > was a request to include it mainstream kernel but it did not happen,
> > > probably waiting for TCMU solution which will be better/cleaner design.
> 
> Indeed, we're proceeding with TCMU as a future upstream acceptable
> implementation.
> 
> > Yes, should have mentioned this, if you are using the suse kernel,
> > they have a fix for this spiral of death problem.
> 
> I'm not to sure what you're referring to WRT the spiral of death, but we did
> patch some LIO issues encountered when a command was aborted while
> outstanding at the LIO backstore layer.
> These specific fixes are carried in the mainline kernel, and can be tested
> using the AbortTaskSimpleAsync libiscsi test.

Awesome, glad this has finally been fixed. Death spiral was referring to when 
using it with ESXi, both the initiator and target effectively hang forever and 
if you didn't catch it soon enough, sometimes you end up having to kill all 
vm's and reboot hosts.

Do you know what kernel version these changes would have first gone into? I 
thought I looked back into this last summer and it was still showing the same 
behavior.

> 
> Cheers, David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Brady Deetz
I appreciate everybody's responses here. I remember the announcement of
Petasan a whole back on here and some concerns about it.

Is anybody using it in production yet?

On Apr 5, 2017 9:58 PM, "Brady Deetz"  wrote:

> I apologize if this is a duplicate of something recent, but I'm not
> finding much. Does the issue still exist where dropping an OSD results in a
> LUN's I/O hanging?
>
> I'm attempting to determine if I have to move off of VMWare in order to
> safely use Ceph as my VM storage.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread David Disseldorp
Hi,

On Thu, 6 Apr 2017 13:31:00 +0100, Nick Fisk wrote:

> > I believe there
> > was a request to include it mainstream kernel but it did not happen,
> > probably waiting for TCMU solution which will be better/cleaner design.  

Indeed, we're proceeding with TCMU as a future upstream acceptable
implementation.

> Yes, should have mentioned this, if you are using the suse kernel, they have
> a fix for this spiral of death problem.

I'm not to sure what you're referring to WRT the spiral of death, but we
did patch some LIO issues encountered when a command was aborted while
outstanding at the LIO backstore layer.
These specific fixes are carried in the mainline kernel, and can be
tested using the AbortTaskSimpleAsync libiscsi test.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Maged Mokhtar
> Sent: 06 April 2017 12:21
> To: Brady Deetz <bde...@gmail.com>; ceph-users <ceph-us...@ceph.com>
> Subject: Re: [ceph-users] rbd iscsi gateway question
> 
> The io hang (it is actually a pause not hang) is done by Ceph only in case
of a
> simultaneous failure of 2 hosts or 2 osds on separate hosts. A single
host/osd
> being out will not cause this.  In PetaSAN project www.petasan.org we use
> LIO/krbd. We have done a lot of tests on VMWare, in case of io failure,
the io
> will block for approx 30s on the VMWare ESX (default timeout, but can be
> configured)  then it will resume on the other MPIO path.
> 
> We are using a custom LIO/kernel upstreamed from SLE 12 used in their
> enterprise storage offering, it supports direct rbd backstore. I believe
there
> was a request to include it mainstream kernel but it did not happen,
> probably waiting for TCMU solution which will be better/cleaner design.

Yes, should have mentioned this, if you are using the suse kernel, they have
a fix for this spiral of death problem. Any other distribution or vanilla
kernel, will hang if a Ceph IO takes longer than about 5-10s. It's the path
failure bit which is the problem, LIO tries to abort the IO, but RBD doesn't
support this yet.

> 
> Cheers /maged
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Maged Mokhtar
The io hang (it is actually a pause not hang) is done by Ceph only in case 
of a simultaneous failure of 2 hosts or 2 osds on separate hosts. A single 
host/osd being out will not cause this.  In PetaSAN project www.petasan.org 
we use LIO/krbd. We have done a lot of tests on VMWare, in case of io 
failure, the io will block for approx 30s on the VMWare ESX (default 
timeout, but can be configured)  then it will resume on the other MPIO path.


We are using a custom LIO/kernel upstreamed from SLE 12 used in their 
enterprise storage offering, it supports direct rbd backstore. I believe 
there was a request to include it mainstream kernel but it did not happen, 
probably waiting for TCMU solution which will be better/cleaner design.


Cheers /maged 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Oliver Humpage

> On 6 Apr 2017, at 08:42, Nick Fisk  wrote:
> 
> I assume Brady is referring to the death spiral LIO gets into with some 
> initiators, including vmware, if an IO takes longer than about 10s.

We have occasionally seen this issue with vmware+LIO, almost always when 
upgrading OSD nodes. Didn’t realise it was a known issue! Apart from that, 
though, we've found LIO generally to be far more performant and stable 
(especially in our multipathing setup) so would like to stick with it if 
possible.

I’m wondering, are there any additional steps we should be taking to minimise 
the risk of LIO timeouts during upgrades? At the moment, we set the cluster to 
“noout”, stop the node’s services, upgrade the packages and reboot. For 
instance, is there a way to drain connections from clients to a particular node 
before shutting down its OSDs?

Thanks,

Oliver.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread yipik...@gmail.com
On 06/04/2017 09:42, Nick Fisk wrote:
>
> I assume Brady is referring to the death spiral LIO gets into with
> some initiators, including vmware, if an IO takes longer than about
> 10s. I haven’t heard of anything, and can’t see any changes, so I
> would assume this issue still remains.
>
>  
>
> I would look at either SCST or NFS for now.
>
LIO-TCMU+librbd-iscsi [1] [2] looks really promising and seams to be the
way to go. It would be great if somebody as insight about the maturity
of the project, is it ready for testing purposes ?

Cheers

Cédric

[1] https://ceph.com/planet/ceph-rbd-and-iscsi/
[2] https://github.com/open-iscsi/tcmu-runner
>
>  
>
> *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
> Behalf Of *Adrian Saul
> *Sent:* 06 April 2017 05:32
> *To:* Brady Deetz <bde...@gmail.com>; ceph-users <ceph-us...@ceph.com>
> *Subject:* Re: [ceph-users] rbd iscsi gateway question
>
>  
>
>  
>
> I am not sure if there is a hard and fast rule you are after, but
> pretty much anything that would cause ceph transactions to be blocked
> (flapping OSD, network loss, hung host) has the potential to block RBD
> IO which would cause your iSCSI LUNs to become unresponsive for that
> period.
>
>  
>
> For the most part though, once that condition clears things keep
> working, so its not like a hang where you need to reboot to clear it. 
> Some situations we have hit with our setup:
>
>  
>
>   * Failed OSDs (dead disks) – no issues
>   * Cluster rebalancing – ok if throttled back to keep service times down
>   * Network packet loss (bad fibre) – painful, broken communication
> everywhere, caused a krbd hang needing a reboot
>   * RBD Snapshot deletion – disk latency through roof, cluster
> unresponsive for minutes at a time, won’t do again.
>
>  
>
>  
>
>  
>
> *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
> Behalf Of *Brady Deetz
> *Sent:* Thursday, 6 April 2017 12:58 PM
> *To:* ceph-users
> *Subject:* [ceph-users] rbd iscsi gateway question
>
>  
>
> I apologize if this is a duplicate of something recent, but I'm not
> finding much. Does the issue still exist where dropping an OSD results
> in a LUN's I/O hanging?
>
>  
>
> I'm attempting to determine if I have to move off of VMWare in order
> to safely use Ceph as my VM storage.
>
> Confidentiality: This email and any attachments are confidential and
> may be subject to copyright, legal or some other professional
> privilege. They are intended solely for the attention and use of the
> named addressee(s). They may only be copied, distributed or disclosed
> with the consent of the copyright owner. If you have received this
> email by mistake or by breach of the confidentiality clause, please
> notify the sender immediately by return email and delete or destroy
> all copies of the email. Any confidentiality, privilege or copyright
> is not waived or lost because this email has been sent to you by mistake.
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Adrian Saul
In my case I am using SCST, so that is what my experience is based on.  For our 
VMware we are using NFS, but for Hyper-V and Solaris we are using iSCSI.

There is actually some work done to make userland SCST which could be 
interesting for making a scst_librbd integration that bypasses the need for 
krbd.



From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: Thursday, 6 April 2017 5:43 PM
To: Adrian Saul; 'Brady Deetz'; 'ceph-users'
Subject: RE: [ceph-users] rbd iscsi gateway question

I assume Brady is referring to the death spiral LIO gets into with some 
initiators, including vmware, if an IO takes longer than about 10s. I haven’t 
heard of anything, and can’t see any changes, so I would assume this issue 
still remains.

I would look at either SCST or NFS for now.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adrian 
Saul
Sent: 06 April 2017 05:32
To: Brady Deetz <bde...@gmail.com>; ceph-users <ceph-us...@ceph.com>
Subject: Re: [ceph-users] rbd iscsi gateway question


I am not sure if there is a hard and fast rule you are after, but pretty much 
anything that would cause ceph transactions to be blocked (flapping OSD, 
network loss, hung host) has the potential to block RBD IO which would cause 
your iSCSI LUNs to become unresponsive for that period.

For the most part though, once that condition clears things keep working, so 
its not like a hang where you need to reboot to clear it.  Some situations we 
have hit with our setup:

-  Failed OSDs (dead disks) – no issues
-  Cluster rebalancing – ok if throttled back to keep service times down
-  Network packet loss (bad fibre) – painful, broken communication 
everywhere, caused a krbd hang needing a reboot
-  RBD Snapshot deletion – disk latency through roof, cluster 
unresponsive for minutes at a time, won’t do again.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brady 
Deetz
Sent: Thursday, 6 April 2017 12:58 PM
To: ceph-users
Subject: [ceph-users] rbd iscsi gateway question

I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging?

I'm attempting to determine if I have to move off of VMWare in order to safely 
use Ceph as my VM storage.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.


Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Nick Fisk
I assume Brady is referring to the death spiral LIO gets into with some 
initiators, including vmware, if an IO takes longer than about 10s. I haven’t 
heard of anything, and can’t see any changes, so I would assume this issue 
still remains.

 

I would look at either SCST or NFS for now.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adrian 
Saul
Sent: 06 April 2017 05:32
To: Brady Deetz <bde...@gmail.com>; ceph-users <ceph-us...@ceph.com>
Subject: Re: [ceph-users] rbd iscsi gateway question

 

 

I am not sure if there is a hard and fast rule you are after, but pretty much 
anything that would cause ceph transactions to be blocked (flapping OSD, 
network loss, hung host) has the potential to block RBD IO which would cause 
your iSCSI LUNs to become unresponsive for that period.

 

For the most part though, once that condition clears things keep working, so 
its not like a hang where you need to reboot to clear it.  Some situations we 
have hit with our setup:

 

*   Failed OSDs (dead disks) – no issues
*   Cluster rebalancing – ok if throttled back to keep service times down
*   Network packet loss (bad fibre) – painful, broken communication 
everywhere, caused a krbd hang needing a reboot
*   RBD Snapshot deletion – disk latency through roof, cluster unresponsive 
for minutes at a time, won’t do again.

 

 

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brady 
Deetz
Sent: Thursday, 6 April 2017 12:58 PM
To: ceph-users
Subject: [ceph-users] rbd iscsi gateway question

 

I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging?

 

I'm attempting to determine if I have to move off of VMWare in order to safely 
use Ceph as my VM storage.

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake. 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-05 Thread Adrian Saul

I am not sure if there is a hard and fast rule you are after, but pretty much 
anything that would cause ceph transactions to be blocked (flapping OSD, 
network loss, hung host) has the potential to block RBD IO which would cause 
your iSCSI LUNs to become unresponsive for that period.

For the most part though, once that condition clears things keep working, so 
its not like a hang where you need to reboot to clear it.  Some situations we 
have hit with our setup:


-  Failed OSDs (dead disks) – no issues

-  Cluster rebalancing – ok if throttled back to keep service times down

-  Network packet loss (bad fibre) – painful, broken communication 
everywhere, caused a krbd hang needing a reboot

-  RBD Snapshot deletion – disk latency through roof, cluster 
unresponsive for minutes at a time, won’t do again.



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brady 
Deetz
Sent: Thursday, 6 April 2017 12:58 PM
To: ceph-users
Subject: [ceph-users] rbd iscsi gateway question

I apologize if this is a duplicate of something recent, but I'm not finding 
much. Does the issue still exist where dropping an OSD results in a LUN's I/O 
hanging?

I'm attempting to determine if I have to move off of VMWare in order to safely 
use Ceph as my VM storage.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd iscsi gateway question

2017-04-05 Thread Brady Deetz
I apologize if this is a duplicate of something recent, but I'm not finding
much. Does the issue still exist where dropping an OSD results in a LUN's
I/O hanging?

I'm attempting to determine if I have to move off of VMWare in order to
safely use Ceph as my VM storage.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com