Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Vu Pham Wed, 14 Oct 2009 14:08:33 -0700

Roland Dreier wrote:

 > +static int srp_dev_loss_tmo = 60;


I don't think the name needs to be this abbreviated.  We don't
necessarily need the srp_ prefix, but probably "device_loss_timeout" is
much clearer without being too much longer.

OK

 > +
 > +module_param(srp_dev_loss_tmo, int, 0444);
 > +MODULE_PARM_DESC(srp_dev_loss_tmo,
 > +          "Default number of seconds that srp transport should \
 > +           insulate the lost of a remote port (default is 60 secs");

I can't understand this description.  What does "insulate the lost" of a
port mean?

I should change "remote port" to just "port". It means that multipathdriver won't know about port offline event (pulling cable, powercycling switch, target...) and won't act/fail-over because srp won'treturn error code until this timeout expired

 > +static void srp_reconnect_work(struct work_struct *work)
 > +{
 > + struct srp_target_port *target =
 > +         container_of(work, struct srp_target_port, work);
 > +
 > + srp_reconnect_target(target);
 > + target->work_in_progress = 0;

surely this is racy... isn't it possible for a context to see
work_in_progress as 1, decide not to schedule the work, and then have it
set to 0 immediately afterwards by the workqueue context?

Yes, it is racy. It should be in lock_irq scsi host_lock

 > +         target->qp_err_timer.expires = time * HZ + jiffies;

given that this is only with 1 second resolution, probably makes sense
to either make it a deferrable timer or round the timeout to avoid extra
wakeups.

OK - I'll round the timeout.

 > +         add_timer(&target->qp_err_timer);

I don't see anywhere that this is canceled on module unload etc?

My mistake. Bart also pointed it out. I'll fix this.

 > +                         srp_qp_err_add_timer(target,
 > +                                              srp_dev_loss_tmo - 55);

 > + if (srp_dev_loss_tmo < 60)
 > +         srp_dev_loss_tmo = 60;

I don't understand the 55 and the 60 here... what are these magic
numbers?  Wouldn't it make sense for the user to specify the actual
timeout that is used (value - 55) rather than the value and then
secretly subtracting 55?

 - R.

First it does not make sense for user to set it below 60; therefore, itis forced to have 60 and above

With async event handler, srp can detect local port offline and settimer exact device_loss_timeout; however, it does not have mechanism todetect remote port offline (srp_daemon need to register trap andcommunicate remote port in/out fabric down to srp driver)I should just add timer (X seconds) instead of (device_loss_tmo - 55) incase receiving cqe error and/or connection close event


-vu

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Reply via email to