Hi,
I was lucky, I eventually managed to reproduce the CONN_PEND
stuck state that so far was elusive. In fact, the PPP retry timer
coincide with the LM_IDLE_TIMEOUT (taking count of other overhead), so
IrNET was able to trigger it with absolute reliability ;-)
I'll explain what happens : primary try to connect a socket to
the remote IAS server (0) at the point LM_IDLE_TIMEOUT is about to
expire. Therefore, the LAP in the primary doesn't get disconnected.
Meanwhile, the secondary expire the LMP-LAP connection and
decide it's back to standby.
First bug is that LMP-LAP change state immediately, whereas
the real LAP will take time to shutdown.
Second bug is that the LAP will never process the disconnect
request sent by LMP-LAP because it will never go back to XMIT_S mode
(it have nothing to send).
For our socket, it receive the connection indication and make
a connection request. Due to the messy state LMP-LAP and LAP are, this
request is lost, and the socket stays in CONN_PEND (bug #3).
Now, as this is the IAS server, no other connection request
will ever succeed. The box is deadlocked and you can just restart the
IrDA stack (not fun).
If you want to reproduce that, the easiest is to set a very
large LM_IDLE_TIMEOUT in the primary so that you have time to lauch
whatever app after secondary timeout and before primary timeout.
I've done a little patch that fix all three bugs for
2.4.5. The fix for bug #3 was already in one of my previous patch
(ir242_conn_pend_stuck_2.diff), I've just cleaned it up a little.
When your last patch will be in the three of Alan, I'll
probably respin it (you need to remove ir242_conn_pend_stuck.diff
first which is bogus).
Also, while you are doing cleanup, you may want to remove all
the refcount which is not used at all (yes, I know, it was me
introducing it).
That's it, have a nice week end...
Jean
diff -u -p linux/net/irda/irlmp_event.d9.c linux/net/irda/irlmp_event.c
--- linux/net/irda/irlmp_event.d9.c Fri Jun 1 01:44:24 2001
+++ linux/net/irda/irlmp_event.c Fri Jun 1 23:41:51 2001
@@ -379,13 +379,23 @@ static void irlmp_state_active(struct la
irlmp_start_idle_timer(self, LM_IDLE_TIMEOUT);
else {
/* No more connections, so close IrLAP */
- irlmp_next_lap_state(self, LAP_STANDBY);
+
+ /* We don't want to change state just yet, because
+ * we want to reflect accurately the real state of
+ * the LAP, not the the state we whish it was in,
+ * so that we don't loose LM_LAP_CONNECT_REQUEST.
+ * In some cases, IrLAP won't close the LAP
+ * immediately. For example, it might still be
+ * retrying packets or waiting for the pf bit.
+ * As the LAP always send a DISCONNECT_INDICATION
+ * in PCLOSE or SCLOSE, just change state on that.
+ * Jean II */
irlap_disconnect_request(self->irlap);
}
break;
case LM_LAP_IDLE_TIMEOUT:
if (HASHBIN_GET_SIZE(self->lsaps) == 0) {
- irlmp_next_lap_state(self, LAP_STANDBY);
+ /* Same reasoning as above - keep state */
irlap_disconnect_request(self->irlap);
}
break;
@@ -472,8 +482,6 @@ static int irlmp_state_disconnected(stru
irlmp_start_watchdog_timer(self, 5*HZ);
break;
case LM_CONNECT_INDICATION:
- irlmp_next_lsap_state(self, LSAP_CONNECT_PEND);
-
if (self->conn_skb) {
WARNING(__FUNCTION__
"(), busy with another request!\n");
@@ -481,7 +489,20 @@ static int irlmp_state_disconnected(stru
}
self->conn_skb = skb;
+ irlmp_next_lsap_state(self, LSAP_CONNECT_PEND);
+
irlmp_do_lap_event(self->lap, LM_LAP_CONNECT_REQUEST, NULL);
+
+ /* Start watchdog timer
+ * This is not mentionned in the spec, but there is a rare
+ * race condition that can get the socket stuck.
+ * If we receive this event while our LAP is closing down,
+ * the LM_LAP_CONNECT_REQUEST get lost and we get stuck in
+ * CONNECT_PEND state forever.
+ * Anyway, it make sense to make sure that we always have
+ * a backup plan. 1 second is plenty (should be immediate).
+ * Jean II */
+ irlmp_start_watchdog_timer(self, 1*HZ);
break;
default:
IRDA_DEBUG(2, __FUNCTION__ "(), Unknown event %s\n",
@@ -533,6 +554,16 @@ static int irlmp_state_connect(struct ls
irlmp_next_lsap_state(self, LSAP_DATA_TRANSFER_READY);
break;
+ case LM_WATCHDOG_TIMEOUT:
+ /* May happen, who knows...
+ * Jean II */
+ IRDA_DEBUG(0, __FUNCTION__ "() WATCHDOG_TIMEOUT!\n");
+
+ /* Here, we should probably disconnect proper */
+ self->dlsap_sel = LSAP_ANY;
+ self->conn_skb = NULL;
+ irlmp_next_lsap_state(self, LSAP_DISCONNECTED);
+ break;
default:
IRDA_DEBUG(0, __FUNCTION__ "(), Unknown event %s\n",
irlmp_event[event]);
@@ -581,6 +612,17 @@ static int irlmp_state_connect_pend(stru
self->conn_skb = NULL;
irlmp_connect_indication(self, skb);
+ break;
+ case LM_WATCHDOG_TIMEOUT:
+ /* Will happen in some rare cases because of a race condition.
+ * Just make sure we don't stay there forever...
+ * Jean II */
+ IRDA_DEBUG(0, __FUNCTION__ "() WATCHDOG_TIMEOUT!\n");
+
+ /* Go back to disconnected mode, keep the socket waiting */
+ self->dlsap_sel = LSAP_ANY;
+ self->conn_skb = NULL;
+ irlmp_next_lsap_state(self, LSAP_DISCONNECTED);
break;
default:
IRDA_DEBUG(0, __FUNCTION__ "Unknown event %s\n",
diff -u -p linux/net/irda/irlap_event.d9.c linux/net/irda/irlap_event.c
--- linux/net/irda/irlap_event.d9.c Fri Jun 1 22:19:23 2001
+++ linux/net/irda/irlap_event.c Fri Jun 1 23:39:20 2001
@@ -1870,11 +1870,24 @@ static int irlap_state_nrm_s(struct irla
/* Update Nr received */
irlap_update_nr_received(self, info->nr);
irlap_wait_min_turn_around(self, &self->qos_tx);
+ irlap_start_wd_timer(self, self->wd_timeout);
- irlap_send_rr_frame(self, RSP_FRAME);
+ /* Note : if the link is idle (this case),
+ * we never go in XMIT_S, so we never get a
+ * chance to process any DISCONNECT_REQUEST.
+ * Do it now ! - Jean II */
+ if (self->disconnect_pending) {
+ /* Disconnect */
+ irlap_send_rd_frame(self);
+ irlap_flush_all_queues(self);
+
+ irlap_next_state(self, LAP_SCLOSE);
+ } else {
+ /* Just send back pf bit */
+ irlap_send_rr_frame(self, RSP_FRAME);
- irlap_start_wd_timer(self, self->wd_timeout);
- irlap_next_state(self, LAP_NRM_S);
+ irlap_next_state(self, LAP_NRM_S);
+ }
}
} else if (nr_status == NR_UNEXPECTED) {
self->remote_busy = FALSE;