--- Begin Message --- Testing srp fail-over with dm-multipath/multipathd/srp_daemon, the current srp implementation will take average 3-5 minutes to complete the error recovery before return DID_BAD_TARGET so that dm-multipath can switch to other paths. During this error recovery, there is no I/O happening (old and new I/Os)

The following patches attempt to help srp fail-over faster and controllable. It introduces srp_dev_loss_tmo module parameter, so that, srp will fail-over after srp_dev_loss_tmo expired.The minimum value for srp_dev_loss_tmo is 60 seconds.

Patch 1/4: recreate qp and cq at reconnect instead of reuse them
Patch 2/4: disconnect request without wait.
Patch 3/4: introducing srp_dev_loss_tmo, creating a timer on qp_error event.
Patch 4/4: setting up async event handler to handle local port up/down events

The fail-over will be more accurate on local port up/down events (ie. someone pull the cable connect local port to switch), it is less accurate on target port up/down events (ie. someone pull the cable connect target port to switch)

To be accurate on target port up/down events, it requires to change srp_daemon to catch the event of IB target port joining/leaving the fabric, then pass these event down to srp driver, srp driver need to implement entry points to receive these events and act upon them. These are missed on this attempt

thanks,
-vu


--- End Message ---

Reply via email to