Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On 04/08/2016 12:06 PM, Keith Busch wrote: On Fri, Apr 08, 2016 at 01:40:06PM -0400, Matthew Wilcox wrote: - Inability to use all queues supported by a device. Intel's P3700 supports 31 queues, but block-mq insists on assigning an even multiple of CPUs to each queue. So if you have 48 CPUs, it will use 24 queues. If you have 128 CPUs, it will only use 16 of the queues. While it'd be better to use all the available h/w resources, that's actually not the worst part. The real problems occur when there are more physical/unique CPUs than h/w queues since blk-mq does not consider CPU topology beyond thread siblings. With 128 CPUs, blk-mq may use all 31 queues P3700 supports, but many CPU groups won't share a last-level-cache. Smarter assignment would reclaim some untapped performance, and we can share such code prior to the session. There's definitely room for improvement in the cpu mapping code. However, on the original complaint, it's by design (or, working as intended) - this was done to keep the layout symmetrical. It's been discussed on the mailing lists before. We can have a discussion whether we should change this or not, of course. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
>Hey Willy, > >> - Interrupt steering needs to be controlled by block-mq instead of >> the driver. It's pointless to have each driver implement its own >> policies on interrupt steering, irqbalanced remains a source of >> end-user frustration, and block-mq can change the queue<->cpu mapping >> without the driver's knowledge. > >I honestly don't think that block-mq is the right place to >*assign* interrupt steering. Not all HW devices are dedicated >to storage, take RDMA for example, a RNIC is shared by block >storage, networking and even user-space workloads so obviously >block-mq can't understand how a user wants to steer interrupts. > >I think that block-mq needs to ask the device driver: >"what is the optimal queue index for cpu X?" and use it >while *someone* will be responsible for optimum interrupt >steering (can be the driver itself or user-space). +0.5 on block-mq asking lower layer on where to place the queue. However, I think it is better that the lower layer push up the data rather the block-mq asking for it. User can change or irqbalance can relocate the interrupt vector(s) during runtime. For Qlogic adapter, it can act in both Initiator & Target Modes at the same time. Certain target vendor might not wants the initiator side to holding this knob. > > From some discussions I had with HCH I think he intends to >use the cpu reverse-mapping API to try and do what's described >above (if I'm not mistaken). >___ >Lsf mailing list >l...@lists.linux-foundation.org >https://lists.linuxfoundation.org/mailman/listinfo/lsf
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
Hey Willy, - Interrupt steering needs to be controlled by block-mq instead of the driver. It's pointless to have each driver implement its own policies on interrupt steering, irqbalanced remains a source of end-user frustration, and block-mq can change the queue<->cpu mapping without the driver's knowledge. I honestly don't think that block-mq is the right place to *assign* interrupt steering. Not all HW devices are dedicated to storage, take RDMA for example, a RNIC is shared by block storage, networking and even user-space workloads so obviously block-mq can't understand how a user wants to steer interrupts. I think that block-mq needs to ask the device driver: "what is the optimal queue index for cpu X?" and use it while *someone* will be responsible for optimum interrupt steering (can be the driver itself or user-space). From some discussions I had with HCH I think he intends to use the cpu reverse-mapping API to try and do what's described above (if I'm not mistaken). -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On Fri, 2016-04-08 at 13:40 -0400, Matthew Wilcox wrote: > On Fri, Apr 08, 2016 at 01:29:26PM +0200, Hannes Reinecke wrote: > > - Interrupt steering needs to be controlled by block-mq instead of > the driver. It's pointless to have each driver implement its own > policies on interrupt steering, irqbalanced remains a source of > end-user frustration, and block-mq can change the queue<->cpu > mapping > without the driver's knowledge. This is the same problem in the networking space as well. When I added affinity_hint to the irq_desc, and then that support into irqbalance, my original approach was to allow the driver to assign affinities. This was shot down because a driver was influencing policy, versus allowing userspace to do so. Meh. If there's something actionable out of this discussion that makes interrupt steering better, I'd like to see us drive it into the networking world as well. That would also let me rip out the affinity_hint stuff overall from irqbalance... -PJ -- PJ Waskiewicz Principal Engineer, NetApp e: pj.waskiew...@netapp.com d: 503.961.3705
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On Fri, 2016-04-08 at 20:08 +0200, Christoph Hellwig wrote: > On Fri, Apr 08, 2016 at 11:00:51AM -0700, James Bottomley wrote: > > > - Inability to use all queues supported by a device. Intel's > > > P3700 > > >supports 31 queues, but block-mq insists on assigning an even > > > multiple > > >of CPUs to each queue. So if you have 48 CPUs, it will use 24 > > > queues. > > >If you have 128 CPUs, it will only use 16 of the queues. > > > > > > - Interrupt steering needs to be controlled by block-mq instead > > > of > > >the driver. It's pointless to have each driver implement its > > > own > > >policies on interrupt steering, irqbalanced remains a source > > > of > > >end-user frustration, and block-mq can change the queue<->cpu > > > mapping > > >without the driver's knowledge. > > > > > > (thanks to Keith for his input on the first and suggestion of the > > > second). > > > > OK, what about two sessions, one for general bitching (the feedback > > sessions) and one for concrete proposals for improvements ... so > > rather > > than just complaining about the problem, if you have concrete ideas > > about fixing it, that would go into the second session. > > We already have the blk-mq interrupt assignment session on the > schedule, > which is about willy's item. And my work in progress code to address > the issue also mostly addresses his item number 1, so I think we can > just keep the schedule most as is and just rename "multiqueue > interrupt > assignment" into "multiqueue interrupt and queue assignment". > > No need to blow it up into three slots. Agreed; I made the adjustments. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On 04/08/2016 10:40 AM, Matthew Wilcox wrote: - Interrupt steering needs to be controlled by block-mq instead of the driver. It's pointless to have each driver implement its own policies on interrupt steering, irqbalanced remains a source of end-user frustration, and block-mq can change the queue<->cpu mapping without the driver's knowledge. I'm looking forward to the day that I will be able to drop my script for spreading interrupts manually (see also the fifth attachment of http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/21312/focus=98409). Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On Fri, Apr 08, 2016 at 11:00:51AM -0700, James Bottomley wrote: > > - Inability to use all queues supported by a device. Intel's P3700 > >supports 31 queues, but block-mq insists on assigning an even multiple > >of CPUs to each queue. So if you have 48 CPUs, it will use 24 queues. > >If you have 128 CPUs, it will only use 16 of the queues. > > > > - Interrupt steering needs to be controlled by block-mq instead of > >the driver. It's pointless to have each driver implement its own > >policies on interrupt steering, irqbalanced remains a source of > >end-user frustration, and block-mq can change the queue<->cpu mapping > >without the driver's knowledge. > > > > (thanks to Keith for his input on the first and suggestion of the second). > > OK, what about two sessions, one for general bitching (the feedback > sessions) and one for concrete proposals for improvements ... so rather > than just complaining about the problem, if you have concrete ideas > about fixing it, that would go into the second session. We already have the blk-mq interrupt assignment session on the schedule, which is about willy's item. And my work in progress code to address the issue also mostly addresses his item number 1, so I think we can just keep the schedule most as is and just rename "multiqueue interrupt assignment" into "multiqueue interrupt and queue assignment". No need to blow it up into three slots. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On Fri, Apr 08, 2016 at 01:40:06PM -0400, Matthew Wilcox wrote: > - Inability to use all queues supported by a device. Intel's P3700 >supports 31 queues, but block-mq insists on assigning an even multiple >of CPUs to each queue. So if you have 48 CPUs, it will use 24 queues. >If you have 128 CPUs, it will only use 16 of the queues. While it'd be better to use all the available h/w resources, that's actually not the worst part. The real problems occur when there are more physical/unique CPUs than h/w queues since blk-mq does not consider CPU topology beyond thread siblings. With 128 CPUs, blk-mq may use all 31 queues P3700 supports, but many CPU groups won't share a last-level-cache. Smarter assignment would reclaim some untapped performance, and we can share such code prior to the session. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On Fri, 2016-04-08 at 13:40 -0400, Matthew Wilcox wrote: > On Fri, Apr 08, 2016 at 01:29:26PM +0200, Hannes Reinecke wrote: > > I'd like to propose a topic on block-mq issues with FC. > > During my performance testing using block/scsi-mq with FC I've hit > > several issues I'd like to discuss: > > If there's a general block-mq bitching session, I have some ideas :-) "Block mq bitching session" is going to look a bit bad on the public schedule, what about "Block MQ implementor feedback"? > - Inability to use all queues supported by a device. Intel's P3700 >supports 31 queues, but block-mq insists on assigning an even multiple >of CPUs to each queue. So if you have 48 CPUs, it will use 24 queues. >If you have 128 CPUs, it will only use 16 of the queues. > > - Interrupt steering needs to be controlled by block-mq instead of >the driver. It's pointless to have each driver implement its own >policies on interrupt steering, irqbalanced remains a source of >end-user frustration, and block-mq can change the queue<->cpu mapping >without the driver's knowledge. > > (thanks to Keith for his input on the first and suggestion of the second). OK, what about two sessions, one for general bitching (the feedback sessions) and one for concrete proposals for improvements ... so rather than just complaining about the problem, if you have concrete ideas about fixing it, that would go into the second session. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On Fri, Apr 08, 2016 at 01:29:26PM +0200, Hannes Reinecke wrote: > I'd like to propose a topic on block-mq issues with FC. > During my performance testing using block/scsi-mq with FC I've hit > several issues I'd like to discuss: If there's a general block-mq bitching session, I have some ideas :-) - Inability to use all queues supported by a device. Intel's P3700 supports 31 queues, but block-mq insists on assigning an even multiple of CPUs to each queue. So if you have 48 CPUs, it will use 24 queues. If you have 128 CPUs, it will only use 16 of the queues. - Interrupt steering needs to be controlled by block-mq instead of the driver. It's pointless to have each driver implement its own policies on interrupt steering, irqbalanced remains a source of end-user frustration, and block-mq can change the queue<->cpu mapping without the driver's knowledge. (thanks to Keith for his input on the first and suggestion of the second). -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On 04/08/2016 04:29 AM, Hannes Reinecke wrote: I'd like to propose a topic on block-mq issues with FC. During my performance testing using block/scsi-mq with FC I've hit several issues I'd like to discuss: - timeout handling: Out of necessity the status of any timed out command is undefined. So to be absolutely safe HBAs will be using extended timeouts here (eg 70secs for lpfc). During that time we _could_ signal I/O timeout to the upper layers, but then the tag will be reused, despite the HBA still having a reference to it. I'd like to discuss how this could be solved best with blk-mq. - Adaption on other HBAs to multiqueue: The current block-mq design assumes symmetric send and receive queues (in effect queue pairs). Any hardware _not_ providing this (like qla2xxx) can not be easily converted to scsi-mq. I'd like to discuss how one could approach converting these drivers. Hello Hannes, Without commenting on the specifics of the above proposal, I'm interested in a further discussion of how to improve multiqueue support for FC drivers. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On Fri, 2016-04-08 at 11:51 -0400, Ewan D. Milne wrote: > On Fri, 2016-04-08 at 08:11 -0700, James Bottomley wrote: > > On Fri, 2016-04-08 at 13:29 +0200, Hannes Reinecke wrote: > > > Hi all, > > > > > > I'd like to propose a topic on block-mq issues with FC. > > > During my performance testing using block/scsi-mq with FC I've > > > hit several issues I'd like to discuss: > > > > > > - timeout handling: > > > Out of necessity the status of any timed out command is > > > undefined. So to be absolutely safe HBAs will be using extended > > > timeouts here (eg 70secs for lpfc). During that time we _could_ > > > signal I/O timeout to the upper layers, but then the tag will be > > > reused, despite the HBA still having a reference to it. I'd like > > > to discuss how this could be solved best with blk-mq. > > > > What's wrong with the obvious answer: the tag shouldn't be re-used > > until after at least the TMF abort. If we need to escalate that > > then it looks like the controller lost the tag and requires a > > bigger hammer. > > > > However, when I look at what we do, it seems the running abort > > handler is triggered from the block timeout function, so where's > > the problem? ... surely mq can't free the tag until that returns, > > because it migh extend the time. > > > > James > > There was some discussion a while back about whether we could > decouple the SCSI EH's recovery of the device from using the failed > scmds, so that once the disposition of the original I/O was > determined (i.e. they had succeeded, failed or timed out & aborted), > the scmds could be returned to a higher layer while the EH attempted > to recover the device. OK, so is the problem the tag or the request pointed to by the scmd? I think in the tag case, as long as it's not recovered until after the abort is processed (i.e. until a disposition is returned from scsi_times_out) then we're fine. If the abort fails, we quiesce the host anyway, so the block layer can happily queue commands with re-used tags and the device will never see the duplication. I can't see how there can be a problem with the requests, because we hold a reference to them in the scmd, so while it might be nicer to release them earlier, it shouldn't be a problem today. James > That way, in a multipath environment, we could submit the I/O on > working paths and avoid lengthy delays while we went through all the > resets. > > We still need a successful abort after a timeout, but at least in the > above scenario we shouldn't be reusing the tags until the device is > recovered, as further I/O should be blocked while EH is running. > > -Ewan > > > -- > To unsubscribe from this list: send the line "unsubscribe linux > -block" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC
On Fri, 2016-04-08 at 08:11 -0700, James Bottomley wrote: > On Fri, 2016-04-08 at 13:29 +0200, Hannes Reinecke wrote: > > Hi all, > > > > I'd like to propose a topic on block-mq issues with FC. > > During my performance testing using block/scsi-mq with FC I've hit > > several issues I'd like to discuss: > > > > - timeout handling: > > Out of necessity the status of any timed out command is undefined. > > So to be absolutely safe HBAs will be using extended timeouts here > > (eg 70secs for lpfc). During that time we _could_ signal I/O timeout > > to the upper layers, but then the tag will be reused, despite the > > HBA still having a reference to it. > > I'd like to discuss how this could be solved best with blk-mq. > > What's wrong with the obvious answer: the tag shouldn't be re-used > until after at least the TMF abort. If we need to escalate that then > it looks like the controller lost the tag and requires a bigger hammer. > > However, when I look at what we do, it seems the running abort handler > is triggered from the block timeout function, so where's the problem? > ... surely mq can't free the tag until that returns, because it might > extend the time. > > James There was some discussion a while back about whether we could decouple the SCSI EH's recovery of the device from using the failed scmds, so that once the disposition of the original I/O was determined (i.e. they had succeeded, failed or timed out & aborted), the scmds could be returned to a higher layer while the EH attempted to recover the device. That way, in a multipath environment, we could submit the I/O on working paths and avoid lengthy delays while we went through all the resets. We still need a successful abort after a timeout, but at least in the above scenario we shouldn't be reusing the tags until the device is recovered, as further I/O should be blocked while EH is running. -Ewan -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html