Hi Kashyap,

On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> Hi All -
> 
> We have seen cpu lock up issue from fields if system has greater (more
> than 96) logical cpu count.
> SAS3.0 controller (Invader series) supports at max 96 msix vector and
> SAS3.5 product (Ventura) supports at max 128 msix vectors.
> 
> This may be a generic issue (if PCI device support  completion on multiple
> reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
> simplify the problem and possible changes to handle such issues. IT HBA
> (mpt3sas) supports multiple reply queues in completion path. Driver
> creates MSI-x vectors for controller as "min of ( FW supported Reply
> queue, Logical CPUs)". If submitter is not interrupted via completion on
> same CPU, there is a loop in the IO path. This behavior can cause
> hard/soft CPU lockups, IO timeout, system sluggish etc.

As I mentioned in another thread, this issue may be solved by SCSI_MQ
via mapping reply queue into hctx of blk_mq, together with 
QUEUE_FLAG_SAME_FORCE,
especially you have set 'smp_affinity_enable' as 1 at default already,
then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can do IRQ vectors spread on
CPUs perfectly for you.

But the following Hannes's patch is required for the conversion.

        https://marc.info/?l=linux-block&m=149130770004507&w=2

> 
> Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
> (e.g. CPU B) is busy with processing the corresponding IO's reply
> descriptors from reply descriptor queue upon receiving the interrupts from
> HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
> is executing the ISR) will see the valid reply descriptors in the reply
> descriptor queue and it will be continuously processing those reply
> descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
> will exit ISR handler if it finds unused reply descriptor in the reply
> descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
> may always see a valid reply descriptor (posted by HBA Firmware after
> processing the IO) in the reply descriptor queue. In worst case, driver
> will not quit from this loop in the ISR handler. Eventually, CPU lockup
> will be detected by watchdog.
> 
> Above mentioned behavior is not common if "rq_affinity" set to 2 or
> affinity_hint is honored by irqbalance as "exact".
> If rq_affinity is set to 2, submitter will be always interrupted via
> completion on same CPU.
> If irqbalance is using "exact" policy, interrupt will be delivered to
> submitter CPU.

Now you have used pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) to get msix
vector number, the irq affinity can't be changed by userspace any more.

> 
> Problem statement -
> If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
> not 1:1, we still have  exposure of issue explained above and for that we
> don't have any solution.
> 
> Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
> device.
> 
> If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
> counts to MSI-x vector count ratio is something like X:1, where X > 1)
> then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
> hard/soft lockups. There won't be any one to one mapping between CPU to
> MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
> shared with group/set of CPUs and there is a possibility of having a loop
> in the IO path within that CPU group and may observe lockups.
> 
> For example: Consider a system having two NUMA nodes and each node having
> four logical CPUs and also consider that number of MSI-x vectors enabled
> on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
> e.g.
> MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
> and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
> 1.
> 
> numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3                                                -->
> MSI-x 0
> node 0 size: 65536 MB
> node 0 free: 63176 MB
> node 1 cpus: 4 5 6 7
> -->MSI-x 1
> node 1 size: 65536 MB
> node 1 free: 63176 MB
> 
> Assume that user started an application which uses all the CPUs of NUMA
> node 0 for issuing the IOs.
> Only one CPU from affinity list (it can be any cpu since this behavior
> depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
> 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> decreasing and ISR processing percentage will be increasing as it is more
> busy with processing the interrupts. Gradually IO submission percentage on
> CPU 0 will be zero and it's ISR processing percentage will be 100
> percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
> 1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
> and only CPU 0 is busy in the ISR path as it always find the valid reply
> descriptor in the reply descriptor queue. Eventually, we will observe the
> hard lockup here.
> 
> Chances of occurring of hard/soft lockups are directly proportional to
> value of X. If value of X is high, then chances of observing CPU lockups
> is high.
> 
> Solution -
> Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
> will execute ISR routine in Softirq context and it will always quit the
> loop based on budget provided in IRQ poll interface.
> 
> In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio is
> X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups due
> to voluntary exit from the reply queue processing based on budget.  Note -
> Only one MSI-x vector is busy doing processing. Irqstat ouput -
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44    122871   122871   0       0       0  IR-PCI-MSI-edge
> mpt3sas0-msix0
>   45        0              0           0       0       0  IR-PCI-MSI-edge
> mpt3sas0-msix1
> 
> Fix-2 - Above fix will avoid lockups, but there can be some performance
> issue if very few reply queue is busy. Driver should round robin the reply
> queue, so that each reply queue is load balanced.  Irqstat ouput after
> driver does reply queue load balance-
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44  62871  62871       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
>   45  62718  62718       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1
> 
> In Summary,
> CPU completing IO which is not contributing to IO submission, may cause
> cpu lockup.
> If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then using
> irq poll interface, we can avoid the CPU lockups and by equally
> distributing the interrupts among the enabled MSI-x interrupts we can
> avoid performance issues.
> 
> We are planning to use both the fixes only if cpu count is more than FW
> supported MSI-x vector.
> Please review and provide your feedback. I have appended both the patches.
> 

Please take a look at pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) and 
SCSI_MQ/blk_mq,
you issue can be solved without much difficulty.

One annoying thing is that SCSI driver has to support both MQ and non-MQ
path. Long time ago, I submitted patch to support force-MQ in driver,
but it is rejected.

Thanks,
Ming

Reply via email to