>>> Donald Williams <[email protected]> schrieb am 15.02.2022 um 17:25 in
Nachricht
<cak3e-ezbjmdhkozgiz8lnmnaz+soca+qek0kpkqm4vq4pz8...@mail.gmail.com>:
> Hello,
>    Something else to check is your MPIO configuration.  I have seen this
> same symptom when the linux MPIO feature "queue_if_no_path" was enabled
> 
>  From the /etc/multipath.conf file showing it enabled.
> 
>     failback                immediate
>     features                "1 queue_if_no_path"

Yes, the actual config is interesting. Especially when usind MD-RAID, you 
typically do not want "1 queue_if_no_path", but if the app can't handle I/O 
errors, one might want it.
For a FC SAN featuring ALUA we use:
...
        polling_interval 5
        max_polling_interval 20
        path_selector "service-time 0"
...
        path_checker "tur"
...
        fast_io_fail_tmo 5
        dev_loss_tmo 600

The logs are helpful, too. For example (there were some paths remaining all the 
time):
Cable was unplugged:
Feb 14 12:56:05 h16 kernel: qla2xxx [0000:41:00.0]-500b:3: LOOP DOWN detected 
(2 7 0 0).
Feb 14 12:56:10 h16 multipathd[5225]: sdbi: mark as failed
Feb 14 12:56:10 h16 multipathd[5225]: SAP_V11-PM: remaining active paths: 7
Feb 14 12:56:10 h16 kernel: sd 3:0:6:3: rejecting I/O to offline device
Feb 14 12:56:10 h16 kernel: sd 3:0:6:14: rejecting I/O to offline device
Feb 14 12:56:10 h16 kernel: sd 3:0:6:15: rejecting I/O to offline device

So 5 seconds later the paths are offlined.

Cable was re-plugged:
Feb 14 12:56:22 h16 kernel: qla2xxx [0000:41:00.0]-500a:3: LOOP UP detected (8 
Gbps).
Feb 14 12:56:22 h16 kernel: qla2xxx [0000:41:00.0]-11a2:3: FEC=enabled (data 
rate).
Feb 14 12:56:26 h16 multipathd[5225]: SAP_CJ1-PM: sdbc - tur checker reports 
path is up
Feb 14 12:56:26 h16 multipathd[5225]: 67:96: reinstated
Feb 14 12:56:26 h16 multipathd[5225]: SAP_CJ1-PM: remaining active paths: 5
Feb 14 12:56:26 h16 kernel: device-mapper: multipath: 254:4: Reinstating path 
67:96.
Feb 14 12:56:26 h16 kernel: device-mapper: multipath: 254:6: Reinstating path 
67:112.

So 4 seconds later new paths are discovered.


Regards,
Ulrich



> 
>  Also, in the past some versions of linux multipathd would wait for a
> very long time before moving all I/O to the remaining path.
> 
>  Regards,
> Don
> 
> 
> On Tue, Feb 15, 2022 at 10:49 AM Zhengyuan Liu <[email protected]>
> wrote:
> 
>> Hi, all
>>
>> We have an online server which uses multipath + iscsi to attach storage
>> from Storage Server. There are two NICs on the server and for each it
>> carries about 20 iscsi sessions and for each session it includes about 50
>>  iscsi devices (yes, there are totally about 2*20*50=2000 iscsi block
>> devices
>>  on the server). The problem is: once a NIC gets faulted, it will take too
>> long
>> (nearly 80s) for multipath to switch to another good NIC link, because it
>> needs to block all iscsi devices over that faulted NIC firstly. The
>> callstack is
>>  shown below:
>>
>>     void iscsi_block_session(struct iscsi_cls_session *session)
>>     {
>>         queue_work(iscsi_eh_timer_workq, &session->block_work);
>>     }
>>
>>  __iscsi_block_session() -> scsi_target_block() -> target_block() ->
>>   device_block() ->  scsi_internal_device_block() -> scsi_stop_queue() ->
>>  blk_mq_quiesce_queue()>synchronize_rcu()
>>
>> For all sessions and all devices, it was processed sequentially, and we
>> have
>> traced that for each synchronize_rcu() call it takes about 80ms, so
>> the total cost
>> is about 80s (80ms * 20 * 50). It's so long that the application can't
>> tolerate and
>> may interrupt service.
>>
>> So my question is that can we optimize the procedure to reduce the time
>> cost on
>> blocking all iscsi devices?  I'm not sure if it is a good idea to increase
>> the
>> workqueue's max_active of iscsi_eh_timer_workq to improve concurrency.
>>
>> Thanks in advance.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "open-iscsi" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> 
> https://groups.google.com/d/msgid/open-iscsi/CAOOPZo4uNCicVmoHa2za0%3DO1_XiBd 
> tBvTuUzqBTeBc3FmDqEJw%40mail.gmail.com
>> .
>>
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/open-iscsi/CAK3e-EZbJMDHkozGiz8LnMNAZ%2BSoC 
> A%2BQeK0kpkqM4vQ4pz86SQ%40mail.gmail.com.



-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/open-iscsi/620CCE20020000A100047D30%40gwsmtp.uni-regensburg.de.

Reply via email to