Hi Mike,
> For the easy case, the SCSI command is sent directly to krbd and so if
> osd_request_timeout is less than M seconds then the command will be
> failed in time and we would not hit the problem above.
> If something happens in the target stack like the SCSI command gets
> stuck/queued then your osd_request_timeout value might be too short.
1)Currently the osd_request_timeout timer (req->r_start_stamp) is
started
in osd_client.c this is late in the stack and as you mentioned things
could be stuck earlier. Is it be better to start this timer early
like in iscsi_target.c iscsit_handle_scsi_cmd() at start of processing
and propagate this value to osd_client ?
Even more accurate will be to use SO_TIMESTAMPING and timestamp the
socket buffers as they are received to compute time of current stream
position. We can also use TCP Timestamps (RFC 7323) sent from the client
initiator, which is enabled by default on Linux/Win/ESX. But this is
more work. What are your thoughts ?
2)I undertand that before switching the path, the initiator will send a
TMF ABORT can we pass this to down to the same abort_request() function
in osd_client that is used for osd_request_timeout expiry ?
Cheers /Maged
On 2018-03-08 20:44, Mike Christie wrote:
> On 03/08/2018 10:59 AM, Lazuardi Nasution wrote:
>
>> Hi Mike,
>>
>> Since I have moved from LIO to TGT, I can do full ALUA (active/active)
>> of multiple gateways. Of course I have to disable any write back cache
>> at any level (RBD cache and TGT cache). It seem to be safe to disable
>> exclusive lock since each RBD image is accessed only by single client
>> and as long as I know mostly ALUA use RR of I/O path.
>
> It might be possible if you have configured your timers correctly but I
> do not think anyone has figured it all out yet.
>
> Here is a simple but long example of the problem. Sorry for the length,
> but I want to make sure people know the risks.
>
> You have 2 iscsi target nodes and 1 iscsi initiator connected to both
> doing active/active over them.
>
> To make it really easy to hit, the iscsi initiator should be connected
> to the target with a different nic port or network than what is being
> used for ceph traffic.
>
> 1. Prep the data. Just clear the first sector of your iscsi disk. On the
> initiator system do:
>
> dd if=/dev/zero of=/dev/sdb count=1 ofile=direct
>
> 2. Kill the network/port for one of the iscsi targets ceph traffic. So
> for example on target node 1 pull its cable for ceph traffic if you set
> it up where iscsi and ceph use different physical ports. iSCSI traffic
> should be unaffected for this test.
>
> 3. Write some new data over the sector we just wrote in #1. This will
> get sent from the initiator to the target ok, but get stuck in the
> rbd/ceph layer since that network is down:
>
> dd if=somefile of=/dev/sdb count=1 ofile=direct ifile=direct
>
> 4. The initiator's eh timers will fire and that will fail and will the
> command will get failed and retired on the other path. After that dd in
> #3 completes run:
>
> dd if=someotherfile of=/dev/sdb count=1 ofile=direct ifile=direct
>
> This should execute quickly since it goes through the good iscsi and
> ceph path right away.
>
> 5. Now plug the cable back in and wait for maybe 30 seconds for the
> network to come back up and the stuck command to run.
>
> 6. Now do
>
> dd if=/dev/sdb of=somenewfile count=1 ifile=direct ofile=direct
>
> The data is going to be the data sent in step 3 and not the new data in
> step 4.
>
> To get around this issue you could try to set the krbd
> osd_request_timeout to a value shorter than the initiator side failover
> time out (for multipath-tools/open-iscsi in linux this would be
> fast_io_fail_tmo/replacement timeout) + the various TMF/EH but also
> account for the transport related timers that might short circut/bypass
> the TMF based EH.
>
> One problem with trying to rely on configuring that is handling all the
> corner cases. So you have:
>
> - Transport (nop) timer or SCSI/TMF command timer set so the
> fast_io_fail/replacement timer starts at N seconds and then fires at M.
> - It is a really bad connection so it takes N - 1 seconds to get the
> SCSI command from the initiator to target.
> - At the N second mark the iscsi connection is dropped the
> fast_io_fail/replacement timer is started.
>
> For the easy case, the SCSI command is sent directly to krbd and so if
> osd_request_timeout is less than M seconds then the command will be
> failed in time and we would not hit the problem above.
>
> If something happens in the target stack like the SCSI command gets
> stuck/queued then your osd_request_timeout value might be too short. For
> example, if you were using tgt/lio right now and this was a
> COMPARE_AND_WRITE, the READ part might take osd_request_timeout - 1
> seconds, and then the write part might take osd_request_timeout -1
> seconds so you need to have your fast_io_fail long enough for that type
> of case. For tgt a WRITE_SAME command might be N WRITEs to krbd, so you
> need to make sure your queue depths are set so you do not end up with
> something similar as the CAW but where M WRITEs get executed and take
> osd_request_timeout -1 seconds then M more, etc and at some point the
> iscsi connection is lost so the failover timer had started. Some ceph
> requests also might be multiple requests.
>
> Maybe an overly paranoid case, but I still worry about because I do not
> want to mess up anyone's data, is that a disk on the iscsi target node
> goes flakey. In the target we do kmalloc(GFP_KERNEL) to execute a SCSI
> command, and that blocks trying to write data to the flakey disk. If the
> disk recovers and we can eventually recover, did you account for the
> recovery timers in that code path when configuring the failover and krbd
> timers.
>
> One other case we have been debating about is if krbd/librbd is able to
> put the ceph request on the wire but then the iscsi connection goes
> down, will the ceph request always get sent to the OSD before the
> initiator side failover timeouts have fired and it starts using a
> different target node.
>
>> Best regards,
>>
>> On Mar 8, 2018 11:54 PM, "Mike Christie" <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> On 03/07/2018 09:24 AM, shadow_lin wrote:
>>> Hi Christie,
>>> Is it safe to use active/passive multipath with krbd with
>> exclusive lock
>>> for lio/tgt/scst/tcmu?
>>
>> No. We tried to use lio and krbd initially, but there is a issue where
>> IO might get stuck in the target/block layer and get executed after new
>> IO. So for lio, tgt and tcmu it is not safe as is right now. We could
>> add some code tcmu's file_example handler which can be used with krbd so
>> it works like the rbd one.
>>
>> I do know enough about SCST right now.
>>
>>> Is it safe to use active/active multipath If use suse kernel with
>>> target_core_rbd?
>>> Thanks.
>>>
>>> 2018-03-07
>>>
>> ------------------------------------------------------------------------
>>> shadowlin
>>>
>>>
>> ------------------------------------------------------------------------
>>>
>>> *发件人:*Mike Christie <[email protected]
>> <mailto:[email protected]>>
>>> *发送时间:*2018-03-07 03:51
>>> *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD
>>> Exclusive Lock
>>> *收件人:*"Lazuardi Nasution"<[email protected]
>> <mailto:[email protected]>>,"Ceph
>>> Users"<[email protected]
>> <mailto:[email protected]>>
>>> *抄送:*
>>>
>>> On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:
>>>> Hi,
>>>>
>>>> I want to do load balanced multipathing (multiple iSCSI
>> gateway/exporter
>>>> nodes) of iSCSI backed with RBD images. Should I disable
>> exclusive lock
>>>> feature? What if I don't disable that feature? I'm using TGT
>> (manual
>>>> way) since I get so many CPU stuck error messages when I was
>> using LIO.
>>>>
>>>
>>> You are using LIO/TGT with krbd right?
>>>
>>> You cannot or shouldn't do active/active multipathing. If you
>> have the
>>> lock enabled then it bounces between paths for each IO and
>> will be slow.
>>> If you do not have it enabled then you can end up with stale IO
>>> overwriting current data.
>>>
>>>
>>>
>>>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com