Hi all,
I found two bugs in discovery that others are likely to hit.
I'd like to hear suggestions on the best short-term and long-term
fixes.
What happens is on a fabric with more than two switches, if the FC
link from the closest switch to the rest of the fabric is flapped,
the closest switch sends an RSCN for each of the other switches'
domains, in rapid succession.
The first RSCN triggers a discovery which sends a GPN_ID request.
Before we receive the response for that, the additional RSCNs are
received and set disc->requested. After the first discovery
is finished, we start another one.
The first bug is that when GPN_ID finishes parsing the response,
it sends the second GPN_ID, but then erroneously increments it's state
variable disc->sequence to indicate it's expecting another frame,
even though it's done. So, the second GPN_ID response is received
but considered an error since it is a first frame, not a subsequent
frame. That error leaves the disc->active flag set, so no more
discoveries get started, and the initiator doesn't see the targets
reappear after the link flap until libfc is unloaded and reloaded.
The first part is easy to fix. Here's a patch (not applicable, just
for review):
diff --git a/drivers/scsi/libfc/fc_disc.c b/drivers/scsi/libfc/fc_disc.c
index 8aea860..f2c6a44 100644
--- a/drivers/scsi/libfc/fc_disc.c
+++ b/drivers/scsi/libfc/fc_disc.c
@@ -803,11 +803,10 @@ static void fc_disc_gpn_ft_resp(struct fc_seq *sp, struct
fc_frame *fp,
seq_cnt, disc->seq_count, fr_sof(fp), fr_eof(fp));
}
if (buf) {
+ disc->seq_count++;
error = fc_disc_gpn_ft_parse(disc, buf, len);
if (error)
fc_disc_error(disc, fp);
- else
- disc->seq_count++;
}
fc_frame_free(fp);
---
In the above code, the call to fc_disc_gpn_ft_parse() can restart discovery,
so incrementing seq_count after that interferes with the new discovery.
Incrementing it before that is harmless.
This leads to the difficult part of the problem. The discovery is repeated
without logging off the remote ports we just discovered and without clearing
the rogue list, so we see all the remote ports twice.
There are a few short-term solutions:
1. Just ignore rports that are discovered but already in the rogue list.
2. Abort/LOGO all the outstanding plogis from the first discovery before
restarting discovery, clearing the rogue list.
3. Don't start PLOGIs from discovery until it is complete. Instead, just
build the rogue rport list as a list of what's been discovered.
Maybe combine this with option 1. Then, when discovery is complete
and there is no pending re-discovery, do the PLOGIs at that point.
If a rediscovery is pending, free the previously discovered rogue
list.
It seems like 3 is a modest approach and not too disruptive.
Longer term, something more sophisticated, using ADISC to reverify the
addresses is needed.
Comments?
Thanks,
Joe
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel