Dev, Vasu wrote:
>> -----Original Message-----
>> From: Mike Christie [mailto:[EMAIL PROTECTED]
>> Sent: Tuesday, August 19, 2008 3:24 AM
>>
>> Actually :)... forget that patch. Attached is a new one
>> fix-my-f-ups2.patch that fixes more possible bugs. I also stopped
>> abusing the ESB bits and defined internal state bits.
>>
> 
> The same bug of system crash after "WARNING: at net/ipv4/tcp_timer.c:290"
> occurred after applying fix-my-f-ups2.patch. I root caused this bug to
> passing up abort response to fcp while still freeing abort response

Ah nice.

> frame as per older abort code in fc_exch_abts_resp() caused
> double frame free for eth driver allocated skb buffer and that likely
> messed up  my networking stack. This explains ssh session hanged first
> and then system hanged after network skb pool messed up. Attached patch
> fixed that double frame free in fcp abort path and with this fix
> your 14 patches + fix-my-f-ups2.patch set works fine. I'm applying
> these patches to git tree.
> 
>> The bugs:
>>
>> 1. If fc_exch_rrq_resp is called because mgr reset is run, then we will
>> not drop the hold taken when the recov qual was found in the abts
>> response. The check for the qual in fc_exch_reset only handles if the
>> timer function has not sent the rrq yet. I added a ref to the aborted_ep
>> to the ep so we could make sure this gets cleaned up.
>>
> 
> Where did you add ref for aborted_ep ? Is it in addition to previously
> set exch ref along with setting ESB_ST_REC_QUAL ?

Oops. I did not mean ref as in hold. I just meant that I added the 
lookup pointer (thinking about Java and instead of pointer I said ref 
:)) to it.


> 
>> 2. If fc_exch_rrq's call to fc_exch_seq_send failed, then it looks like
>> we want to retry the rrq, so we need to reset the ESB_ST_REC_QUAL bit.
>>
> 
> I couldn't understand this bug. Currently fc_exch_rrq's set this bit if
> fc_exch_seq_send fails and then exch timer to retry rrq. The

I do not see that in the code right now. If fc_exch_seq_send fails in 
fc_exch_rrq we just set the timer and then exit.


> fc_exch_timeout() or fc_exch_reset() will reset this bit again.

fc_exch_timeout does this:
                 ep->esb_stat = e_stat & ~ESB_ST_REC_QUAL
                 spin_unlock_bh(&ep->ex_lock);
                 if (e_stat & ESB_ST_REC_QUAL)
                         fc_exch_rrq(ep);

So when we call fc_exch_rrq ESB_ST_REC_QUAL is not set on ep->esb_stat.

Then in fc_exch_rrq we do this:

         if (!fc_exch_seq_send(lp, fp, fc_exch_rrq_resp, ep, lp->e_d_tov,
                               lp->fid, did, FC_FC_SEQ_INIT | 
FC_FC_END_SEQ))
                 fc_exch_timer_set(ep, ep->r_a_tov);


If fc_exch_seq_send fails and returns NULL, then we set the timer. I 
thought we set the timer here to make us retry sending the rrq. But 
becuase the ESB_ST_REC_QUAL bit is cleared before fc_exch_rrq we need to 
do it in fc_exch_rrq before we set the timer.


> 
> However I do see issues in resetting this bit before call to
> fc_exch_rrq and possible race with fc_exch_reset(), I guess this
> can be avoided using exch lock through out fc_exch_rrq and
> reset this bit when rrq is done.
> 
>> 3. If the timer function is sending the rrq, the ep for the rrq could
>> already be on the ex_list. If fc_exch_reset runs and sees it on the
>> list, then it would call fc_exch_done_locked (releasing the hold from
>> the timer functions alloc) on the ep while the timer function was still
>> accessing the ep. fc_exch_reset could then exit and we would release
>> fc_exch_mgr_reset's hold which would free the ep while the timer
>> function was still using it.
>>
> 
> The exch lock throughout fc_exch_rrq will help here.

Ah yeah nice. Because we add the new one to the back the list we would 
always wait on the aborted ep. Ok. I guess we do not need the 
del_timer_sync for this.
_______________________________________________
devel mailing list
[email protected]
http://www.open-fcoe.org/mailman/listinfo/devel

Reply via email to