This series contains three performance improvements targeting the SCSI
and block layers on multi-socket NUMA and heavily loaded SMP systems.
On multi-socket NUMA systems we observed extreme I/O throughput variance
of 50-60% between runs. This series identifies and fixes two root causes:
cross-node memory accesses due to NUMA-unaware allocations in the scan
path, and false sharing between hot atomic counters in struct request_queue
and struct scsi_device.
Performance notes:
Tested on a dual-socket NUMA system (2x 32-core, 256 GB/socket) with
an mpi3mr HBA, running fio (random read, 4K, QD 64, 16 jobs, 60 s,
direct I/O). IOPS figures are in KIOPS (thousands of IOPS):
Configuration Avg KIOPS Range (KIOPS) Spread
Baseline 6,255 4,200 - 6,700 ~37%
Baseline + all patches 7,350 7,000 - 7,700 ~10%
Key findings:
These patches combinedly reduces the observed 50-60% run-to-run variance
to under 10%, significantly improving workload predictability and
improves IOPs by 16-18%.
No functional regressions observed.
Changes in v3
-------------
-Handled feedback from Bart Van Assche and John Garry.
-Added a patch for shost local NUMA allocation.
-Converted ioerr_cnt and iotmo_cnt atomic counters into per-cpu counters.
Changes in v2
--------------
Patch 1 — Same functional goal as v1 patch 1: NUMA-local scsi_device /
scsi_target allocations in the scan path so steady-state I/O does not
habitually touch remote memory when the host has a fixed DMA/NUMA
affinity.
Patch 2 — Replaces v1’s ____cacheline_aligned_in_smp on
nr_active_requests_shared_tags with removal of the shared-tag fairness
throttling machinery (including hctx_may_queue(), blk_mq_hw_ctx.nr_active,
and request_queue.nr_active_requests_shared_tags and their updates).
This follows the earlier standalone proposal by Bart Van Assche [1],
rebased for the current tree; it removes the high-frequency atomic
accounting that motivated the v1 false-sharing workaround and, in our
testing, improves IOPS on the order of roughly 16–18% for the shared-tag
workload exercised.
Patch 3 — Replaces v1’s cache-line padding of iodone_cnt with
percpu_counter for both iorequest_cnt and iodone_cnt, so submission and
completion paths mostly update CPU-local state instead of bouncing a
single cache line, without inflating struct scsi_device for SMP
alignment.
Merge / review hints
--------------------
Patch 3 touches the block layer and should have block maintainer review;
rest of patches are SCSI-oriented. Please route or Ack as your subsystem
workflow requires.
Bart Van Assche (1):
block: drop shared-tag fairness throttling
James Rizzo (1):
scsi: scan: allocate sdev and starget on the NUMA node of the host
adapter
Sumit Saxena (2):
scsi: host: allocate struct Scsi_Host on the NUMA node of the host
adapter
scsi: use percpu counters for iostat counters in struct scsi_device
block/blk-core.c | 2 -
block/blk-mq-debugfs.c | 22 ++++-
block/blk-mq-tag.c | 4 -
block/blk-mq.c | 17 +---
block/blk-mq.h | 100 ----------------------
drivers/scsi/3w-9xxx.c | 2 +-
drivers/scsi/3w-sas.c | 2 +-
drivers/scsi/3w-xxxx.c | 2 +-
drivers/scsi/53c700.c | 2 +-
drivers/scsi/BusLogic.c | 2 +-
drivers/scsi/a100u2w.c | 2 +-
drivers/scsi/a2091.c | 2 +-
drivers/scsi/a3000.c | 2 +-
drivers/scsi/aacraid/linit.c | 2 +-
drivers/scsi/advansys.c | 6 +-
drivers/scsi/aha152x.c | 2 +-
drivers/scsi/aha1542.c | 2 +-
drivers/scsi/aha1740.c | 2 +-
drivers/scsi/aic7xxx/aic79xx_osm.c | 2 +-
drivers/scsi/aic7xxx/aic7xxx_osm.c | 2 +-
drivers/scsi/aic94xx/aic94xx_init.c | 2 +-
drivers/scsi/am53c974.c | 2 +-
drivers/scsi/arcmsr/arcmsr_hba.c | 3 +-
drivers/scsi/arm/acornscsi.c | 2 +-
drivers/scsi/arm/arxescsi.c | 2 +-
drivers/scsi/arm/cumana_1.c | 2 +-
drivers/scsi/arm/cumana_2.c | 2 +-
drivers/scsi/arm/eesox.c | 2 +-
drivers/scsi/arm/oak.c | 2 +-
drivers/scsi/arm/powertec.c | 2 +-
drivers/scsi/atari_scsi.c | 2 +-
drivers/scsi/atp870u.c | 2 +-
drivers/scsi/bfa/bfad_im.c | 2 +-
drivers/scsi/csiostor/csio_init.c | 4 +-
drivers/scsi/dc395x.c | 2 +-
drivers/scsi/dmx3191d.c | 2 +-
drivers/scsi/elx/efct/efct_xport.c | 4 +-
drivers/scsi/esas2r/esas2r_main.c | 2 +-
drivers/scsi/fdomain.c | 2 +-
drivers/scsi/fnic/fnic_main.c | 2 +-
drivers/scsi/g_NCR5380.c | 2 +-
drivers/scsi/gvp11.c | 2 +-
drivers/scsi/hisi_sas/hisi_sas_main.c | 2 +-
drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 2 +-
drivers/scsi/hosts.c | 6 +-
drivers/scsi/hpsa.c | 2 +-
drivers/scsi/hptiop.c | 2 +-
drivers/scsi/ibmvscsi/ibmvfc.c | 2 +-
drivers/scsi/ibmvscsi/ibmvscsi.c | 2 +-
drivers/scsi/imm.c | 2 +-
drivers/scsi/initio.c | 2 +-
drivers/scsi/ipr.c | 2 +-
drivers/scsi/ips.c | 2 +-
drivers/scsi/isci/init.c | 2 +-
drivers/scsi/jazz_esp.c | 2 +-
drivers/scsi/libiscsi.c | 2 +-
drivers/scsi/lpfc/lpfc_init.c | 2 +-
drivers/scsi/mac53c94.c | 2 +-
drivers/scsi/mac_esp.c | 2 +-
drivers/scsi/mac_scsi.c | 2 +-
drivers/scsi/megaraid.c | 2 +-
drivers/scsi/megaraid/megaraid_mbox.c | 2 +-
drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
drivers/scsi/mesh.c | 2 +-
drivers/scsi/mpi3mr/mpi3mr_os.c | 2 +-
drivers/scsi/mpt3sas/mpt3sas_scsih.c | 4 +-
drivers/scsi/mvme147.c | 2 +-
drivers/scsi/mvsas/mv_init.c | 2 +-
drivers/scsi/mvumi.c | 2 +-
drivers/scsi/myrb.c | 2 +-
drivers/scsi/myrs.c | 2 +-
drivers/scsi/ncr53c8xx.c | 2 +-
drivers/scsi/nsp32.c | 2 +-
drivers/scsi/pcmcia/nsp_cs.c | 2 +-
drivers/scsi/pcmcia/qlogic_stub.c | 2 +-
drivers/scsi/pcmcia/sym53c500_cs.c | 2 +-
drivers/scsi/pm8001/pm8001_init.c | 2 +-
drivers/scsi/pmcraid.c | 2 +-
drivers/scsi/ppa.c | 2 +-
drivers/scsi/ps3rom.c | 2 +-
drivers/scsi/qla1280.c | 2 +-
drivers/scsi/qla2xxx/qla_mid.c | 2 +-
drivers/scsi/qla2xxx/qla_os.c | 2 +-
drivers/scsi/qlogicfas.c | 2 +-
drivers/scsi/qlogicpti.c | 2 +-
drivers/scsi/scsi_debug.c | 2 +-
drivers/scsi/scsi_error.c | 4 +-
drivers/scsi/scsi_lib.c | 10 +--
drivers/scsi/scsi_scan.c | 15 +++-
drivers/scsi/scsi_sysfs.c | 23 +++--
drivers/scsi/sd.c | 2 +-
drivers/scsi/sgiwd93.c | 2 +-
drivers/scsi/smartpqi/smartpqi_init.c | 2 +-
drivers/scsi/snic/snic_main.c | 2 +-
drivers/scsi/stex.c | 2 +-
drivers/scsi/storvsc_drv.c | 2 +-
drivers/scsi/sun3_scsi.c | 2 +-
drivers/scsi/sun3x_esp.c | 2 +-
drivers/scsi/sun_esp.c | 2 +-
drivers/scsi/sym53c8xx_2/sym_glue.c | 2 +-
drivers/scsi/virtio_scsi.c | 2 +-
drivers/scsi/vmw_pvscsi.c | 2 +-
drivers/scsi/wd719x.c | 2 +-
drivers/scsi/xen-scsifront.c | 2 +-
drivers/scsi/zorro_esp.c | 2 +-
include/linux/blk-mq.h | 6 --
include/linux/blkdev.h | 2 -
include/scsi/libfc.h | 2 +-
include/scsi/scsi_device.h | 9 +-
include/scsi/scsi_host.h | 3 +-
110 files changed, 168 insertions(+), 258 deletions(-)
--
2.43.7