On 2017-05-17 17:01:53 [+0200], To Chad Dupuis wrote:
> On 2017-05-12 11:55:52 [-0400], Chad Dupuis wrote:
> > Ok, I believe I've found the issue here. The machine that the test has
> > performed on had many more possible CPUs than active CPUs. We calculate
> > which CPU to the work time on in bnx2fc_process_new_cqes() like this:
> >
> > unsigned int cpu = wqe % num_possible_cpus();
> >
> > Since not all CPUs are active, we were trying to schedule work on
> > non-active CPUs which meant that the upper layers were never notified of
> > the completion. With this change:
> >
> > diff --git a/drivers/scsi/bnx2fc/bnx2fc_hwi.c
> > b/drivers/scsi/bnx2fc/bnx2fc_hwi.c
> > index c2288d6..6f08e43 100644
> > --- a/drivers/scsi/bnx2fc/bnx2fc_hwi.c
> > +++ b/drivers/scsi/bnx2fc/bnx2fc_hwi.c
> > @@ -1042,7 +1042,12 @@ static int bnx2fc_process_new_cqes(struct
> > bnx2fc_rport *tgt)
> > /* Pending work request completion */
> > struct bnx2fc_work *work = NULL;
> > struct bnx2fc_percpu_s *fps = NULL;
> > - unsigned int cpu = wqe % num_possible_cpus();
> > + unsigned int cpu = wqe % num_active_cpus();
> > +
> > + /* Sanity check cpu to make sure it's online */
> > + if (!cpu_active(cpu))
> > + /* Default to CPU 0 */
> > + cpu = 0;
> >
> > work = bnx2fc_alloc_work(tgt, wqe);
> > if (work) {
> >
> > The issue is fixed.
> >
> > Sebastian, can you add this change to your patch set?
>
> Are sure that you can reliably reproduce the issue and fix it with the
> patch above? Because this patch:
oh. Okay. Now it clicked. It can fix the issue but it is still possible,
that CPU0 goes down between your check for it and schedule_work_on()
returning. Let my think of something…
Sebastian