Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-12 Thread Jens Axboe

On 04/08/2016 12:06 PM, Keith Busch wrote:

On Fri, Apr 08, 2016 at 01:40:06PM -0400, Matthew Wilcox wrote:

  - Inability to use all queues supported by a device.  Intel's P3700
supports 31 queues, but block-mq insists on assigning an even multiple
of CPUs to each queue.  So if you have 48 CPUs, it will use 24 queues.
If you have 128 CPUs, it will only use 16 of the queues.


While it'd be better to use all the available h/w resources, that's
actually not the worst part.

The real problems occur when there are more physical/unique CPUs than
h/w queues since blk-mq does not consider CPU topology beyond thread
siblings. With 128 CPUs, blk-mq may use all 31 queues P3700 supports,
but many CPU groups won't share a last-level-cache.

Smarter assignment would reclaim some untapped performance, and we can
share such code prior to the session.


There's definitely room for improvement in the cpu mapping code.

However, on the original complaint, it's by design (or, working as 
intended) - this was done to keep the layout symmetrical. It's been 
discussed on the mailing lists before. We can have a discussion whether 
we should change this or not, of course.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-12 Thread Quinn Tran

>Hey Willy,
>
>>   - Interrupt steering needs to be controlled by block-mq instead of
>> the driver.  It's pointless to have each driver implement its own
>> policies on interrupt steering, irqbalanced remains a source of
>> end-user frustration, and block-mq can change the queue<->cpu mapping
>> without the driver's knowledge.
>
>I honestly don't think that block-mq is the right place to
>*assign* interrupt steering. Not all HW devices are dedicated
>to storage, take RDMA for example, a RNIC is shared by block
>storage, networking and even user-space workloads so obviously
>block-mq can't understand how a user wants to steer interrupts.
>
>I think that block-mq needs to ask the device driver:
>"what is the optimal queue index for cpu X?" and use it
>while *someone* will be responsible for optimum interrupt
>steering (can be the driver itself or user-space).

+0.5 on block-mq asking lower layer on where to place the queue.  However, I 
think it is better that the lower layer push up the data rather the block-mq 
asking for it.  User can change or irqbalance can relocate the interrupt 
vector(s) during runtime.  

For Qlogic adapter, it can act in both Initiator & Target Modes at the same 
time.  Certain target vendor might not wants the initiator side to holding this 
knob.




>
> From some discussions I had with HCH I think he intends to
>use the cpu reverse-mapping API to try and do what's described
>above (if I'm not mistaken).

>___
>Lsf mailing list
>l...@lists.linux-foundation.org
>https://lists.linuxfoundation.org/mailman/listinfo/lsf


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-10 Thread Sagi Grimberg

Hey Willy,


  - Interrupt steering needs to be controlled by block-mq instead of
the driver.  It's pointless to have each driver implement its own
policies on interrupt steering, irqbalanced remains a source of
end-user frustration, and block-mq can change the queue<->cpu mapping
without the driver's knowledge.


I honestly don't think that block-mq is the right place to
*assign* interrupt steering. Not all HW devices are dedicated
to storage, take RDMA for example, a RNIC is shared by block
storage, networking and even user-space workloads so obviously
block-mq can't understand how a user wants to steer interrupts.

I think that block-mq needs to ask the device driver:
"what is the optimal queue index for cpu X?" and use it
while *someone* will be responsible for optimum interrupt
steering (can be the driver itself or user-space).

From some discussions I had with HCH I think he intends to
use the cpu reverse-mapping API to try and do what's described
above (if I'm not mistaken).
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread Waskiewicz, PJ
On Fri, 2016-04-08 at 13:40 -0400, Matthew Wilcox wrote:
> On Fri, Apr 08, 2016 at 01:29:26PM +0200, Hannes Reinecke wrote:
> >  - Interrupt steering needs to be controlled by block-mq instead of
>    the driver.  It's pointless to have each driver implement its own
>    policies on interrupt steering, irqbalanced remains a source of
>    end-user frustration, and block-mq can change the queue<->cpu
> mapping
>    without the driver's knowledge.

This is the same problem in the networking space as well.  When I added
affinity_hint to the irq_desc, and then that support into irqbalance,
my original approach was to allow the driver to assign affinities.
 This was shot down because a driver was influencing policy, versus
allowing userspace to do so.  Meh.

If there's something actionable out of this discussion that makes
interrupt steering better, I'd like to see us drive it into the
networking world as well.  That would also let me rip out the
affinity_hint stuff overall from irqbalance...

-PJ

-- 
PJ Waskiewicz
Principal Engineer, NetApp
e: pj.waskiew...@netapp.com
d: 503.961.3705


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread James Bottomley
On Fri, 2016-04-08 at 20:08 +0200, Christoph Hellwig wrote:
> On Fri, Apr 08, 2016 at 11:00:51AM -0700, James Bottomley wrote:
> > >  - Inability to use all queues supported by a device.  Intel's
> > > P3700
> > >supports 31 queues, but block-mq insists on assigning an even
> > > multiple
> > >of CPUs to each queue.  So if you have 48 CPUs, it will use 24
> > > queues.
> > >If you have 128 CPUs, it will only use 16 of the queues.
> > > 
> > >  - Interrupt steering needs to be controlled by block-mq instead
> > > of
> > >the driver.  It's pointless to have each driver implement its
> > > own
> > >policies on interrupt steering, irqbalanced remains a source
> > > of
> > >end-user frustration, and block-mq can change the queue<->cpu
> > > mapping
> > >without the driver's knowledge.
> > > 
> > > (thanks to Keith for his input on the first and suggestion of the
> > > second).
> > 
> > OK, what about two sessions, one for general bitching (the feedback
> > sessions) and one for concrete proposals for improvements ... so
> > rather
> > than just complaining about the problem, if you have concrete ideas
> > about fixing it, that would go into the second session.
> 
> We already have the blk-mq interrupt assignment session on the
> schedule,
> which is about willy's item.  And my work in progress code to address
> the issue also mostly addresses his item number 1, so I think we can
> just keep the schedule most as is and just rename "multiqueue
> interrupt
> assignment" into "multiqueue interrupt and queue assignment".
> 
> No need to blow it up into three slots.

Agreed; I made the adjustments.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread Bart Van Assche

On 04/08/2016 10:40 AM, Matthew Wilcox wrote:

  - Interrupt steering needs to be controlled by block-mq instead of
the driver.  It's pointless to have each driver implement its own
policies on interrupt steering, irqbalanced remains a source of
end-user frustration, and block-mq can change the queue<->cpu mapping
without the driver's knowledge.


I'm looking forward to the day that I will be able to drop my script for 
spreading interrupts manually (see also the fifth attachment of 
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/21312/focus=98409).


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread Christoph Hellwig
On Fri, Apr 08, 2016 at 11:00:51AM -0700, James Bottomley wrote:
> >  - Inability to use all queues supported by a device.  Intel's P3700
> >supports 31 queues, but block-mq insists on assigning an even multiple
> >of CPUs to each queue.  So if you have 48 CPUs, it will use 24 queues.
> >If you have 128 CPUs, it will only use 16 of the queues.
> > 
> >  - Interrupt steering needs to be controlled by block-mq instead of
> >the driver.  It's pointless to have each driver implement its own
> >policies on interrupt steering, irqbalanced remains a source of
> >end-user frustration, and block-mq can change the queue<->cpu mapping
> >without the driver's knowledge.
> > 
> > (thanks to Keith for his input on the first and suggestion of the second).
> 
> OK, what about two sessions, one for general bitching (the feedback
> sessions) and one for concrete proposals for improvements ... so rather
> than just complaining about the problem, if you have concrete ideas
> about fixing it, that would go into the second session.

We already have the blk-mq interrupt assignment session on the schedule,
which is about willy's item.  And my work in progress code to address
the issue also mostly addresses his item number 1, so I think we can
just keep the schedule most as is and just rename "multiqueue interrupt
assignment" into "multiqueue interrupt and queue assignment".

No need to blow it up into three slots.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread Keith Busch
On Fri, Apr 08, 2016 at 01:40:06PM -0400, Matthew Wilcox wrote:
>  - Inability to use all queues supported by a device.  Intel's P3700
>supports 31 queues, but block-mq insists on assigning an even multiple
>of CPUs to each queue.  So if you have 48 CPUs, it will use 24 queues.
>If you have 128 CPUs, it will only use 16 of the queues.

While it'd be better to use all the available h/w resources, that's
actually not the worst part.

The real problems occur when there are more physical/unique CPUs than
h/w queues since blk-mq does not consider CPU topology beyond thread
siblings. With 128 CPUs, blk-mq may use all 31 queues P3700 supports,
but many CPU groups won't share a last-level-cache.

Smarter assignment would reclaim some untapped performance, and we can
share such code prior to the session.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread James Bottomley
On Fri, 2016-04-08 at 13:40 -0400, Matthew Wilcox wrote:
> On Fri, Apr 08, 2016 at 01:29:26PM +0200, Hannes Reinecke wrote:
> > I'd like to propose a topic on block-mq issues with FC.
> > During my performance testing using block/scsi-mq with FC I've hit
> > several issues I'd like to discuss:
> 
> If there's a general block-mq bitching session, I have some ideas :-)

"Block mq bitching session" is going to look a bit bad on the public
schedule, what about "Block MQ implementor feedback"?

>  - Inability to use all queues supported by a device.  Intel's P3700
>supports 31 queues, but block-mq insists on assigning an even multiple
>of CPUs to each queue.  So if you have 48 CPUs, it will use 24 queues.
>If you have 128 CPUs, it will only use 16 of the queues.
> 
>  - Interrupt steering needs to be controlled by block-mq instead of
>the driver.  It's pointless to have each driver implement its own
>policies on interrupt steering, irqbalanced remains a source of
>end-user frustration, and block-mq can change the queue<->cpu mapping
>without the driver's knowledge.
> 
> (thanks to Keith for his input on the first and suggestion of the second).

OK, what about two sessions, one for general bitching (the feedback
sessions) and one for concrete proposals for improvements ... so rather
than just complaining about the problem, if you have concrete ideas
about fixing it, that would go into the second session.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread Matthew Wilcox
On Fri, Apr 08, 2016 at 01:29:26PM +0200, Hannes Reinecke wrote:
> I'd like to propose a topic on block-mq issues with FC.
> During my performance testing using block/scsi-mq with FC I've hit
> several issues I'd like to discuss:

If there's a general block-mq bitching session, I have some ideas :-)

 - Inability to use all queues supported by a device.  Intel's P3700
   supports 31 queues, but block-mq insists on assigning an even multiple
   of CPUs to each queue.  So if you have 48 CPUs, it will use 24 queues.
   If you have 128 CPUs, it will only use 16 of the queues.

 - Interrupt steering needs to be controlled by block-mq instead of
   the driver.  It's pointless to have each driver implement its own
   policies on interrupt steering, irqbalanced remains a source of
   end-user frustration, and block-mq can change the queue<->cpu mapping
   without the driver's knowledge.

(thanks to Keith for his input on the first and suggestion of the second).
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread Bart Van Assche

On 04/08/2016 04:29 AM, Hannes Reinecke wrote:

I'd like to propose a topic on block-mq issues with FC.
During my performance testing using block/scsi-mq with FC I've hit
several issues I'd like to discuss:

- timeout handling:
Out of necessity the status of any timed out command is undefined.
So to be absolutely safe HBAs will be using extended timeouts here
(eg 70secs for lpfc). During that time we _could_ signal I/O timeout
to the upper layers, but then the tag will be reused, despite the
HBA still having a reference to it.
I'd like to discuss how this could be solved best with blk-mq.

- Adaption on other HBAs to multiqueue:
The current block-mq design assumes symmetric send and receive
queues (in effect queue pairs). Any hardware _not_ providing this
(like qla2xxx) can not be easily converted to scsi-mq. I'd like to
discuss how one could approach converting these drivers.


Hello Hannes,

Without commenting on the specifics of the above proposal, I'm 
interested in a further discussion of how to improve multiqueue support 
for FC drivers.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread James Bottomley
On Fri, 2016-04-08 at 11:51 -0400, Ewan D. Milne wrote:
> On Fri, 2016-04-08 at 08:11 -0700, James Bottomley wrote:
> > On Fri, 2016-04-08 at 13:29 +0200, Hannes Reinecke wrote:
> > > Hi all,
> > > 
> > > I'd like to propose a topic on block-mq issues with FC.
> > > During my performance testing using block/scsi-mq with FC I've 
> > > hit several issues I'd like to discuss:
> > > 
> > > - timeout handling:
> > > Out of necessity the status of any timed out command is 
> > > undefined. So to be absolutely safe HBAs will be using extended 
> > > timeouts here (eg 70secs for lpfc). During that time we _could_ 
> > > signal I/O timeout to the upper layers, but then the tag will be 
> > > reused, despite the HBA still having a reference to it. I'd like
> > > to discuss how this could be solved best with blk-mq.
> > 
> > What's wrong with the obvious answer: the tag shouldn't be re-used
> > until after at least the TMF abort.  If we need to escalate that 
> > then it looks like the controller lost the tag and requires a 
> > bigger hammer.
> > 
> > However, when I look at what we do, it seems the running abort 
> > handler is triggered from the block timeout function, so where's 
> > the problem? ... surely mq can't free the tag until that returns, 
> > because it migh extend the time.
> > 
> > James
> 
> There was some discussion a while back about whether we could 
> decouple the SCSI EH's recovery of the device from using the failed 
> scmds, so that once the disposition of the original I/O was 
> determined (i.e. they had succeeded, failed or timed out & aborted), 
> the scmds could be returned to a higher layer while the EH attempted 
> to recover the device.

OK, so is the problem the tag or the request pointed to by the scmd?  I
think in the tag case, as long as it's not recovered until after the
abort is processed (i.e. until a disposition is returned from
scsi_times_out) then we're fine.  If the abort fails, we quiesce the
host anyway, so the block layer can happily queue commands with re-used
tags and the device will never see the duplication.

I can't see how there can be a problem with the requests, because we
hold a reference to them in the scmd, so while it might be nicer to
release them earlier, it shouldn't be a problem today.

James


>   That way, in a multipath environment, we could submit the I/O on
> working paths and avoid lengthy delays while we went through all the
> resets.
> 
> We still need a successful abort after a timeout, but at least in the
> above scenario we shouldn't be reusing the tags until the device is
> recovered, as further I/O should be blocked while EH is running.
> 
> -Ewan
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux
> -block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread Ewan D. Milne
On Fri, 2016-04-08 at 08:11 -0700, James Bottomley wrote:
> On Fri, 2016-04-08 at 13:29 +0200, Hannes Reinecke wrote:
> > Hi all,
> > 
> > I'd like to propose a topic on block-mq issues with FC.
> > During my performance testing using block/scsi-mq with FC I've hit
> > several issues I'd like to discuss:
> > 
> > - timeout handling:
> > Out of necessity the status of any timed out command is undefined.
> > So to be absolutely safe HBAs will be using extended timeouts here
> > (eg 70secs for lpfc). During that time we _could_ signal I/O timeout
> > to the upper layers, but then the tag will be reused, despite the
> > HBA still having a reference to it.
> > I'd like to discuss how this could be solved best with blk-mq.
> 
> What's wrong with the obvious answer: the tag shouldn't be re-used
> until after at least the TMF abort.  If we need to escalate that then
> it looks like the controller lost the tag and requires a bigger hammer.
> 
> However, when I look at what we do, it seems the running abort handler
> is triggered from the block timeout function, so where's the problem?
> ... surely mq can't free the tag until that returns, because it might
> extend the time. 
> 
> James

There was some discussion a while back about whether we could decouple
the SCSI EH's recovery of the device from using the failed scmds, so
that once the disposition of the original I/O was determined (i.e. they
had succeeded, failed or timed out & aborted), the scmds could be
returned to a higher layer while the EH attempted to recover the
device.  That way, in a multipath environment, we could submit the I/O
on working paths and avoid lengthy delays while we went through all the
resets.

We still need a successful abort after a timeout, but at least in the
above scenario we shouldn't be reusing the tags until the device is
recovered, as further I/O should be blocked while EH is running.

-Ewan


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html