Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Vu Pham Thu, 29 Oct 2009 09:38:07 -0700

Roland Dreier wrote:

 > + if (timer_pending(&target->qp_err_timer))
 > +         del_timer_sync(&target->qp_err_timer);
 > +
 >   spin_unlock_irq(target->scsi_host->host_lock);


As was pointed out, I don't think you can do del_timer_sync while
holding the lock, since the timer function takes the same lock.

I don't think that it takes the same lock (ie. host_lock).; however, theproblem is kernel configured CONFIG_LOCKDEP, del_timer_sync() calllocal_irq_save/restore() to acquire/release timer->lockdep_map

But I don't know that just switching to del_timer without the sync works
here ... without the sync then the timeout function could still run any
time after the del_timer, even after everything gets freed.

I will move del_timer_sync() of the host_lock. It's safe because we onlysetup the same timer when the target state is ALIVE, however, targetstate is already REMOVED when del_timer_sync() is called

BTW the test of timer_pending isn't needed here... del_timer does the
test internally anyway.

I do agree it would be very good to improve the SRP error handling.  I
have some concerns about the overall design here -- it seems that if we
handle connection failure and allow a new connection to proceed while
cleaning up asynchronously, then this opens the door to a lot of
complexity, and I don't see that complexity handled in this patchset.
For example, the new connection could fail too before the old one is
done cleaning up, etc, etc and we end up with an arbitrarily large queue
of things waiting to clean up.  Or maybe it really it is simpler than that.

It's is complicated; however, it is not as complicated as you said.

There are two flows/places where we clean-up connection and create newone (ie. call srp_reconnect_target) in async event timer and scsi errorhandling srp_reset_host(); however, in srp_reset_host() the patch 3/4check for true condition of timer_pending or qp_in_error or targetstate wrapped in host_lock then it won't call srp_reconnect_target(). Itleft one flow/place to clean up / create new connection.

We destroy cm connection/cq/qp and create new connection/cq/qp in thesame flow/place; we also already clean up all scsi/srp resourcesassociated with old connection before create new one; therefore, theonly resources pended for clean up is cm/connection resources, qp/cqlow-level resources. And I believe the lower IB stack does that well.

I think the best way to move this forward would be to post another
cleaned up version of your patch set.  Please try to reorganize things
so each patch is reasonably self contained.  Of course your patchset is
taking multiple steps to improve things.  But as much as possible,
please try to avoid combining two things into a single patch, and
conversely also try to avoid putting things into a patch that don't make
sense without a later patch.

Avoiding policy in the kernel as much as possible in terms of hard-coded
timeouts etc is a good goal too.

Also it would help to give each patch a separate descriptive subject,
and put as much detail in the changelogs as you can.

Thanks,
  Roland

I'll follow the instructions above, rework the patch set and resubmit

thanks,
-vu
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Reply via email to