subject:"bfq\-mq performance comparison to cfq"

Re: bfq-mq performance comparison to cfq

2017-04-26 Thread Bart Van Assche

On Wed, 2017-04-26 at 10:18 +0200, Paolo Valente wrote:
> I guess that both the above issues may not be dramatic. In contrast,
> the following last issue seems harder to address: BFQ uses two
> different privileging schemes, one suitable for interactive
> applications, and one suitable for soft real-time applications. So,
> what scheme should BFQ enable for processes in the RT I/O class?
>
> Because of these concerns, also for I/O I would find much clearer and
> flexible an ad-hoc, complete and explicit solution like the one(s)
> Juri reports (I've already nagged some of the recipients here to get
> support and collaboration on such sort of extensions of the basic
> benefits of a good I/O scheduler).

The numerical values of I/O priorities are part of the API between kernel
and user space API and hence the numerical value associated with a class
must not change. But we would associate different priority values with
interactive and soft real-time applications, e.g. IOPRIO_CLASS_RT(0) for
soft real-time applications and IOPRIO_CLASS_RT(7) for interactive
applications. See also http://man7.org/linux/man-pages/man2/ioprio_set.2.html.

In my opinion the above proposal does not contradict with what has been
proposed for informed run-times. We could e.g. add support for configuring
the I/O priority to the block I/O controller cgroup.

No matter how informed run-times communicate application constraints to the
kernel, the configured I/O scheduler and the block layer will have to realize
these constraints. If anyone thinks that there is a mechanism that is better
suited to communicate these constraints to the kernel than I/O priorities I'm
interested to hear about that alternative.

Bart.

Re: bfq-mq performance comparison to cfq

2017-04-26 Thread Bart Van Assche

On Wed, 2017-04-26 at 10:18 +0200, Paolo Valente wrote:
> I guess that both the above issues may not be dramatic. In contrast,
> the following last issue seems harder to address: BFQ uses two
> different privileging schemes, one suitable for interactive
> applications, and one suitable for soft real-time applications. So,
> what scheme should BFQ enable for processes in the RT I/O class?
>
> Because of these concerns, also for I/O I would find much clearer and
> flexible an ad-hoc, complete and explicit solution like the one(s)
> Juri reports (I've already nagged some of the recipients here to get
> support and collaboration on such sort of extensions of the basic
> benefits of a good I/O scheduler).

The numerical values of I/O priorities are part of the API between kernel
and user space API and hence the numerical value associated with a class
must not change. But we would associate different priority values with
interactive and soft real-time applications, e.g. IOPRIO_CLASS_RT(0) for
soft real-time applications and IOPRIO_CLASS_RT(7) for interactive
applications. See also http://man7.org/linux/man-pages/man2/ioprio_set.2.html.

In my opinion the above proposal does not contradict with what has been
proposed for informed run-times. We could e.g. add support for configuring
the I/O priority to the block I/O controller cgroup.

No matter how informed run-times communicate application constraints to the
kernel, the configured I/O scheduler and the block layer will have to realize
these constraints. If anyone thinks that there is a mechanism that is better
suited to communicate these constraints to the kernel than I/O priorities I'm
interested to hear about that alternative.

Bart.

Re: bfq-mq performance comparison to cfq

2017-04-26 Thread Paolo Valente


> Il giorno 25 apr 2017, alle ore 11:40, Juri Lelli  ha 
> scritto:
> 
> Hi,
> 
> sorry if I jump into this interesting conversation, but I felt some people
> might have missed this and might be interested as well (even if from a
> slightly different POW). Let me Cc them (Patrick, Morten, Peter, Joel,
> Andres).
> 
> On 19/04/17 09:02, Paolo Valente wrote:
>> 
>>> Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On 04/11/17 00:29, Paolo Valente wrote:
 
> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>  ha scritto:
> 
> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>> That said, if you do always want maximum throughput, even at the
>> expense of latency, then just switch off low-latency heuristics, i.e.,
>> set low_latency to 0.  Depending on the device, setting slice_ilde to
>> 0 may help a lot too (as well as with CFQ).  If the throughput is
>> still low also after forcing BFQ to an only-throughput mode, then you
>> hit some bug, and I'll have a little more work to do ...
> 
> Has it been considered to make applications tell the I/O scheduler
> whether to optimize for latency or for throughput? It shouldn't be that
> hard for window managers and shells to figure out whether or not a new
> application that is being started is interactive or not. This would
> require a mechanism that allows applications to provide such information
> to the I/O scheduler. Wouldn't that be a better approach than the I/O
> scheduler trying to guess whether or not an application is an interactive
> application?
 
 IMO that would be an (or maybe the) optimal solution, in terms of both
 throughput and latency.  We have even developed a prototype doing what
 you propose, for Android.  Unfortunately, I have not yet succeeded in
 getting support, to turn it into candidate production code, or to make
 a similar solution for lsb-compliant systems.
>>> 
>>> Hello Paolo,
>>> 
>>> What API was used by the Android application to tell the I/O scheduler 
>>> to optimize for latency? Do you think that it would be sufficient if the 
>>> application uses the ioprio_set() system call to set the I/O priority to 
>>> IOPRIO_CLASS_RT?
>>> 
>> 
>> That's exactly the hack we are using in our prototype.  However, it
>> can only be a temporary hack, because it mixes two slightly different
>> concepts: 1) the activation of weight raising and other mechanisms for
>> reducing latency for the target app, 2) the assignment of a different
>> priority class, which (cleanly) means just that processes in a lower
>> priority class will be served only when the processes of the target
>> app have no pending I/O request.  Finding a clean boosting API would
>> be one of the main steps to turn our prototype into a usable solution.
>> 
> 
> I also need to append here latest Bart's reply (which hasn't all the
> context):
> 
> On 19/04/17 15:43, Bart Van Assche wrote:
>> On Wed, 2017-04-19 at 09:02 +0200, Paolo Valente wrote:
 Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
  ha scritto:
 What API was used by the Android application to tell the I/O scheduler 
 to optimize for latency? Do you think that it would be sufficient if the 
 application uses the ioprio_set() system call to set the I/O priority to 
 IOPRIO_CLASS_RT?
>>> 
>>> That's exactly the hack we are using in our prototype.  However, it
>>> can only be a temporary hack, because it mixes two slightly different
>>> concepts: 1) the activation of weight raising and other mechanisms for
>>> reducing latency for the target app, 2) the assignment of a different
>>> priority class, which (cleanly) means just that processes in a lower
>>> priority class will be served only when the processes of the target
>>> app have no pending I/O request.  Finding a clean boosting API would
>>> be one of the main steps to turn our prototype into a usable solution.
>> 
>> Hello Paolo,
>> 
>> Sorry but I do not agree that you call this use of I/O priorities a hack.
>> I also do not agree that I/O requests submitted by processes in a lower
>> priority class will only be served by the I/O scheduler when there are no
>> pending requests in a higher class. It wouldn't be that hard to modify I/O
>> schedulers that support I/O priorities to avoid the starvation you referred
>> to. What I expect that will happen is that sooner or later a Linux
>> distributor will start receiving bug reports about the heuristics for
>> detecting interactive and streaming applications and that the person who
>> will work on that bug report will realize that it will be easier to remove
>> those heuristics from BFQ and to modify streaming applications and the
>> software that starts interactive applications (e.g. a window manager) to
>> use a higher I/O priority.
>>

Re: bfq-mq performance comparison to cfq

2017-04-26 Thread Paolo Valente


> Il giorno 25 apr 2017, alle ore 11:40, Juri Lelli  ha 
> scritto:
> 
> Hi,
> 
> sorry if I jump into this interesting conversation, but I felt some people
> might have missed this and might be interested as well (even if from a
> slightly different POW). Let me Cc them (Patrick, Morten, Peter, Joel,
> Andres).
> 
> On 19/04/17 09:02, Paolo Valente wrote:
>> 
>>> Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On 04/11/17 00:29, Paolo Valente wrote:
 
> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>  ha scritto:
> 
> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>> That said, if you do always want maximum throughput, even at the
>> expense of latency, then just switch off low-latency heuristics, i.e.,
>> set low_latency to 0.  Depending on the device, setting slice_ilde to
>> 0 may help a lot too (as well as with CFQ).  If the throughput is
>> still low also after forcing BFQ to an only-throughput mode, then you
>> hit some bug, and I'll have a little more work to do ...
> 
> Has it been considered to make applications tell the I/O scheduler
> whether to optimize for latency or for throughput? It shouldn't be that
> hard for window managers and shells to figure out whether or not a new
> application that is being started is interactive or not. This would
> require a mechanism that allows applications to provide such information
> to the I/O scheduler. Wouldn't that be a better approach than the I/O
> scheduler trying to guess whether or not an application is an interactive
> application?
 
 IMO that would be an (or maybe the) optimal solution, in terms of both
 throughput and latency.  We have even developed a prototype doing what
 you propose, for Android.  Unfortunately, I have not yet succeeded in
 getting support, to turn it into candidate production code, or to make
 a similar solution for lsb-compliant systems.
>>> 
>>> Hello Paolo,
>>> 
>>> What API was used by the Android application to tell the I/O scheduler 
>>> to optimize for latency? Do you think that it would be sufficient if the 
>>> application uses the ioprio_set() system call to set the I/O priority to 
>>> IOPRIO_CLASS_RT?
>>> 
>> 
>> That's exactly the hack we are using in our prototype.  However, it
>> can only be a temporary hack, because it mixes two slightly different
>> concepts: 1) the activation of weight raising and other mechanisms for
>> reducing latency for the target app, 2) the assignment of a different
>> priority class, which (cleanly) means just that processes in a lower
>> priority class will be served only when the processes of the target
>> app have no pending I/O request.  Finding a clean boosting API would
>> be one of the main steps to turn our prototype into a usable solution.
>> 
> 
> I also need to append here latest Bart's reply (which hasn't all the
> context):
> 
> On 19/04/17 15:43, Bart Van Assche wrote:
>> On Wed, 2017-04-19 at 09:02 +0200, Paolo Valente wrote:
 Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
  ha scritto:
 What API was used by the Android application to tell the I/O scheduler 
 to optimize for latency? Do you think that it would be sufficient if the 
 application uses the ioprio_set() system call to set the I/O priority to 
 IOPRIO_CLASS_RT?
>>> 
>>> That's exactly the hack we are using in our prototype.  However, it
>>> can only be a temporary hack, because it mixes two slightly different
>>> concepts: 1) the activation of weight raising and other mechanisms for
>>> reducing latency for the target app, 2) the assignment of a different
>>> priority class, which (cleanly) means just that processes in a lower
>>> priority class will be served only when the processes of the target
>>> app have no pending I/O request.  Finding a clean boosting API would
>>> be one of the main steps to turn our prototype into a usable solution.
>> 
>> Hello Paolo,
>> 
>> Sorry but I do not agree that you call this use of I/O priorities a hack.
>> I also do not agree that I/O requests submitted by processes in a lower
>> priority class will only be served by the I/O scheduler when there are no
>> pending requests in a higher class. It wouldn't be that hard to modify I/O
>> schedulers that support I/O priorities to avoid the starvation you referred
>> to. What I expect that will happen is that sooner or later a Linux
>> distributor will start receiving bug reports about the heuristics for
>> detecting interactive and streaming applications and that the person who
>> will work on that bug report will realize that it will be easier to remove
>> those heuristics from BFQ and to modify streaming applications and the
>> software that starts interactive applications (e.g. a window manager) to
>> use a higher I/O priority.
>> 
>> Please also note that what I described above may require to introduce
>> additional I/O priorities

Re: bfq-mq performance comparison to cfq

2017-04-25 Thread Juri Lelli

Hi,

sorry if I jump into this interesting conversation, but I felt some people
might have missed this and might be interested as well (even if from a
slightly different POW). Let me Cc them (Patrick, Morten, Peter, Joel,
Andres).

On 19/04/17 09:02, Paolo Valente wrote:
> 
> > Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
> >  ha scritto:
> > 
> > On 04/11/17 00:29, Paolo Valente wrote:
> >> 
> >>> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
> >>>  ha scritto:
> >>> 
> >>> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>  That said, if you do always want maximum throughput, even at the
>  expense of latency, then just switch off low-latency heuristics, i.e.,
>  set low_latency to 0.  Depending on the device, setting slice_ilde to
>  0 may help a lot too (as well as with CFQ).  If the throughput is
>  still low also after forcing BFQ to an only-throughput mode, then you
>  hit some bug, and I'll have a little more work to do ...
> >>> 
> >>> Has it been considered to make applications tell the I/O scheduler
> >>> whether to optimize for latency or for throughput? It shouldn't be that
> >>> hard for window managers and shells to figure out whether or not a new
> >>> application that is being started is interactive or not. This would
> >>> require a mechanism that allows applications to provide such information
> >>> to the I/O scheduler. Wouldn't that be a better approach than the I/O
> >>> scheduler trying to guess whether or not an application is an interactive
> >>> application?
> >> 
> >> IMO that would be an (or maybe the) optimal solution, in terms of both
> >> throughput and latency.  We have even developed a prototype doing what
> >> you propose, for Android.  Unfortunately, I have not yet succeeded in
> >> getting support, to turn it into candidate production code, or to make
> >> a similar solution for lsb-compliant systems.
> > 
> > Hello Paolo,
> > 
> > What API was used by the Android application to tell the I/O scheduler 
> > to optimize for latency? Do you think that it would be sufficient if the 
> > application uses the ioprio_set() system call to set the I/O priority to 
> > IOPRIO_CLASS_RT?
> > 
> 
> That's exactly the hack we are using in our prototype.  However, it
> can only be a temporary hack, because it mixes two slightly different
> concepts: 1) the activation of weight raising and other mechanisms for
> reducing latency for the target app, 2) the assignment of a different
> priority class, which (cleanly) means just that processes in a lower
> priority class will be served only when the processes of the target
> app have no pending I/O request.  Finding a clean boosting API would
> be one of the main steps to turn our prototype into a usable solution.
> 

I also need to append here latest Bart's reply (which hasn't all the
context):

On 19/04/17 15:43, Bart Van Assche wrote:
> On Wed, 2017-04-19 at 09:02 +0200, Paolo Valente wrote:
> > > Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
> > >  ha scritto:
> > > What API was used by the Android application to tell the I/O scheduler 
> > > to optimize for latency? Do you think that it would be sufficient if the 
> > > application uses the ioprio_set() system call to set the I/O priority to 
> > > IOPRIO_CLASS_RT?
> > 
> > That's exactly the hack we are using in our prototype.  However, it
> > can only be a temporary hack, because it mixes two slightly different
> > concepts: 1) the activation of weight raising and other mechanisms for
> > reducing latency for the target app, 2) the assignment of a different
> > priority class, which (cleanly) means just that processes in a lower
> > priority class will be served only when the processes of the target
> > app have no pending I/O request.  Finding a clean boosting API would
> > be one of the main steps to turn our prototype into a usable solution.
> 
> Hello Paolo,
> 
> Sorry but I do not agree that you call this use of I/O priorities a hack.
> I also do not agree that I/O requests submitted by processes in a lower
> priority class will only be served by the I/O scheduler when there are no
> pending requests in a higher class. It wouldn't be that hard to modify I/O
> schedulers that support I/O priorities to avoid the starvation you referred
> to. What I expect that will happen is that sooner or later a Linux
> distributor will start receiving bug reports about the heuristics for
> detecting interactive and streaming applications and that the person who
> will work on that bug report will realize that it will be easier to remove
> those heuristics from BFQ and to modify streaming applications and the
> software that starts interactive applications (e.g. a window manager) to
> use a higher I/O priority.
> 
> Please also note that what I described above may require to introduce
> additional I/O priorities in the Linux kernel next to the

Re: bfq-mq performance comparison to cfq

2017-04-25 Thread Juri Lelli

Hi,

sorry if I jump into this interesting conversation, but I felt some people
might have missed this and might be interested as well (even if from a
slightly different POW). Let me Cc them (Patrick, Morten, Peter, Joel,
Andres).

On 19/04/17 09:02, Paolo Valente wrote:
> 
> > Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
> >  ha scritto:
> > 
> > On 04/11/17 00:29, Paolo Valente wrote:
> >> 
> >>> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
> >>>  ha scritto:
> >>> 
> >>> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>  That said, if you do always want maximum throughput, even at the
>  expense of latency, then just switch off low-latency heuristics, i.e.,
>  set low_latency to 0.  Depending on the device, setting slice_ilde to
>  0 may help a lot too (as well as with CFQ).  If the throughput is
>  still low also after forcing BFQ to an only-throughput mode, then you
>  hit some bug, and I'll have a little more work to do ...
> >>> 
> >>> Has it been considered to make applications tell the I/O scheduler
> >>> whether to optimize for latency or for throughput? It shouldn't be that
> >>> hard for window managers and shells to figure out whether or not a new
> >>> application that is being started is interactive or not. This would
> >>> require a mechanism that allows applications to provide such information
> >>> to the I/O scheduler. Wouldn't that be a better approach than the I/O
> >>> scheduler trying to guess whether or not an application is an interactive
> >>> application?
> >> 
> >> IMO that would be an (or maybe the) optimal solution, in terms of both
> >> throughput and latency.  We have even developed a prototype doing what
> >> you propose, for Android.  Unfortunately, I have not yet succeeded in
> >> getting support, to turn it into candidate production code, or to make
> >> a similar solution for lsb-compliant systems.
> > 
> > Hello Paolo,
> > 
> > What API was used by the Android application to tell the I/O scheduler 
> > to optimize for latency? Do you think that it would be sufficient if the 
> > application uses the ioprio_set() system call to set the I/O priority to 
> > IOPRIO_CLASS_RT?
> > 
> 
> That's exactly the hack we are using in our prototype.  However, it
> can only be a temporary hack, because it mixes two slightly different
> concepts: 1) the activation of weight raising and other mechanisms for
> reducing latency for the target app, 2) the assignment of a different
> priority class, which (cleanly) means just that processes in a lower
> priority class will be served only when the processes of the target
> app have no pending I/O request.  Finding a clean boosting API would
> be one of the main steps to turn our prototype into a usable solution.
> 

I also need to append here latest Bart's reply (which hasn't all the
context):

On 19/04/17 15:43, Bart Van Assche wrote:
> On Wed, 2017-04-19 at 09:02 +0200, Paolo Valente wrote:
> > > Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
> > >  ha scritto:
> > > What API was used by the Android application to tell the I/O scheduler 
> > > to optimize for latency? Do you think that it would be sufficient if the 
> > > application uses the ioprio_set() system call to set the I/O priority to 
> > > IOPRIO_CLASS_RT?
> > 
> > That's exactly the hack we are using in our prototype.  However, it
> > can only be a temporary hack, because it mixes two slightly different
> > concepts: 1) the activation of weight raising and other mechanisms for
> > reducing latency for the target app, 2) the assignment of a different
> > priority class, which (cleanly) means just that processes in a lower
> > priority class will be served only when the processes of the target
> > app have no pending I/O request.  Finding a clean boosting API would
> > be one of the main steps to turn our prototype into a usable solution.
> 
> Hello Paolo,
> 
> Sorry but I do not agree that you call this use of I/O priorities a hack.
> I also do not agree that I/O requests submitted by processes in a lower
> priority class will only be served by the I/O scheduler when there are no
> pending requests in a higher class. It wouldn't be that hard to modify I/O
> schedulers that support I/O priorities to avoid the starvation you referred
> to. What I expect that will happen is that sooner or later a Linux
> distributor will start receiving bug reports about the heuristics for
> detecting interactive and streaming applications and that the person who
> will work on that bug report will realize that it will be easier to remove
> those heuristics from BFQ and to modify streaming applications and the
> software that starts interactive applications (e.g. a window manager) to
> use a higher I/O priority.
> 
> Please also note that what I described above may require to introduce
> additional I/O priorities in the Linux kernel next to the existing I/O
> priorities RT, BE and NONE and that this may require to map multiple of
>

Re: bfq-mq performance comparison to cfq

2017-04-19 Thread Bart Van Assche

On Wed, 2017-04-19 at 09:02 +0200, Paolo Valente wrote:
> > Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
> >  ha scritto:
> > What API was used by the Android application to tell the I/O scheduler 
> > to optimize for latency? Do you think that it would be sufficient if the 
> > application uses the ioprio_set() system call to set the I/O priority to 
> > IOPRIO_CLASS_RT?
> 
> That's exactly the hack we are using in our prototype.  However, it
> can only be a temporary hack, because it mixes two slightly different
> concepts: 1) the activation of weight raising and other mechanisms for
> reducing latency for the target app, 2) the assignment of a different
> priority class, which (cleanly) means just that processes in a lower
> priority class will be served only when the processes of the target
> app have no pending I/O request.  Finding a clean boosting API would
> be one of the main steps to turn our prototype into a usable solution.

Hello Paolo,

Sorry but I do not agree that you call this use of I/O priorities a hack.
I also do not agree that I/O requests submitted by processes in a lower
priority class will only be served by the I/O scheduler when there are no
pending requests in a higher class. It wouldn't be that hard to modify I/O
schedulers that support I/O priorities to avoid the starvation you referred
to. What I expect that will happen is that sooner or later a Linux
distributor will start receiving bug reports about the heuristics for
detecting interactive and streaming applications and that the person who
will work on that bug report will realize that it will be easier to remove
those heuristics from BFQ and to modify streaming applications and the
software that starts interactive applications (e.g. a window manager) to
use a higher I/O priority.

Please also note that what I described above may require to introduce
additional I/O priorities in the Linux kernel next to the existing I/O
priorities RT, BE and NONE and that this may require to map multiple of
these priorities onto the same drive priority.

Bart.

Re: bfq-mq performance comparison to cfq

2017-04-19 Thread Bart Van Assche

On Wed, 2017-04-19 at 09:02 +0200, Paolo Valente wrote:
> > Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
> >  ha scritto:
> > What API was used by the Android application to tell the I/O scheduler 
> > to optimize for latency? Do you think that it would be sufficient if the 
> > application uses the ioprio_set() system call to set the I/O priority to 
> > IOPRIO_CLASS_RT?
> 
> That's exactly the hack we are using in our prototype.  However, it
> can only be a temporary hack, because it mixes two slightly different
> concepts: 1) the activation of weight raising and other mechanisms for
> reducing latency for the target app, 2) the assignment of a different
> priority class, which (cleanly) means just that processes in a lower
> priority class will be served only when the processes of the target
> app have no pending I/O request.  Finding a clean boosting API would
> be one of the main steps to turn our prototype into a usable solution.

Hello Paolo,

Sorry but I do not agree that you call this use of I/O priorities a hack.
I also do not agree that I/O requests submitted by processes in a lower
priority class will only be served by the I/O scheduler when there are no
pending requests in a higher class. It wouldn't be that hard to modify I/O
schedulers that support I/O priorities to avoid the starvation you referred
to. What I expect that will happen is that sooner or later a Linux
distributor will start receiving bug reports about the heuristics for
detecting interactive and streaming applications and that the person who
will work on that bug report will realize that it will be easier to remove
those heuristics from BFQ and to modify streaming applications and the
software that starts interactive applications (e.g. a window manager) to
use a higher I/O priority.

Please also note that what I described above may require to introduce
additional I/O priorities in the Linux kernel next to the existing I/O
priorities RT, BE and NONE and that this may require to map multiple of
these priorities onto the same drive priority.

Bart.

Re: bfq-mq performance comparison to cfq

2017-04-19 Thread Paolo Valente


> Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
>  ha scritto:
> 
> On 04/11/17 00:29, Paolo Valente wrote:
>> 
>>> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
 That said, if you do always want maximum throughput, even at the
 expense of latency, then just switch off low-latency heuristics, i.e.,
 set low_latency to 0.  Depending on the device, setting slice_ilde to
 0 may help a lot too (as well as with CFQ).  If the throughput is
 still low also after forcing BFQ to an only-throughput mode, then you
 hit some bug, and I'll have a little more work to do ...
>>> 
>>> Has it been considered to make applications tell the I/O scheduler
>>> whether to optimize for latency or for throughput? It shouldn't be that
>>> hard for window managers and shells to figure out whether or not a new
>>> application that is being started is interactive or not. This would
>>> require a mechanism that allows applications to provide such information
>>> to the I/O scheduler. Wouldn't that be a better approach than the I/O
>>> scheduler trying to guess whether or not an application is an interactive
>>> application?
>> 
>> IMO that would be an (or maybe the) optimal solution, in terms of both
>> throughput and latency.  We have even developed a prototype doing what
>> you propose, for Android.  Unfortunately, I have not yet succeeded in
>> getting support, to turn it into candidate production code, or to make
>> a similar solution for lsb-compliant systems.
> 
> Hello Paolo,
> 
> What API was used by the Android application to tell the I/O scheduler 
> to optimize for latency? Do you think that it would be sufficient if the 
> application uses the ioprio_set() system call to set the I/O priority to 
> IOPRIO_CLASS_RT?
> 

That's exactly the hack we are using in our prototype.  However, it
can only be a temporary hack, because it mixes two slightly different
concepts: 1) the activation of weight raising and other mechanisms for
reducing latency for the target app, 2) the assignment of a different
priority class, which (cleanly) means just that processes in a lower
priority class will be served only when the processes of the target
app have no pending I/O request.  Finding a clean boosting API would
be one of the main steps to turn our prototype into a usable solution.

Thanks,
Paolo

> Thanks,
> 
> Bart.

Re: bfq-mq performance comparison to cfq

2017-04-19 Thread Paolo Valente


> Il giorno 19 apr 2017, alle ore 07:01, Bart Van Assche 
>  ha scritto:
> 
> On 04/11/17 00:29, Paolo Valente wrote:
>> 
>>> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
 That said, if you do always want maximum throughput, even at the
 expense of latency, then just switch off low-latency heuristics, i.e.,
 set low_latency to 0.  Depending on the device, setting slice_ilde to
 0 may help a lot too (as well as with CFQ).  If the throughput is
 still low also after forcing BFQ to an only-throughput mode, then you
 hit some bug, and I'll have a little more work to do ...
>>> 
>>> Has it been considered to make applications tell the I/O scheduler
>>> whether to optimize for latency or for throughput? It shouldn't be that
>>> hard for window managers and shells to figure out whether or not a new
>>> application that is being started is interactive or not. This would
>>> require a mechanism that allows applications to provide such information
>>> to the I/O scheduler. Wouldn't that be a better approach than the I/O
>>> scheduler trying to guess whether or not an application is an interactive
>>> application?
>> 
>> IMO that would be an (or maybe the) optimal solution, in terms of both
>> throughput and latency.  We have even developed a prototype doing what
>> you propose, for Android.  Unfortunately, I have not yet succeeded in
>> getting support, to turn it into candidate production code, or to make
>> a similar solution for lsb-compliant systems.
> 
> Hello Paolo,
> 
> What API was used by the Android application to tell the I/O scheduler 
> to optimize for latency? Do you think that it would be sufficient if the 
> application uses the ioprio_set() system call to set the I/O priority to 
> IOPRIO_CLASS_RT?
> 

That's exactly the hack we are using in our prototype.  However, it
can only be a temporary hack, because it mixes two slightly different
concepts: 1) the activation of weight raising and other mechanisms for
reducing latency for the target app, 2) the assignment of a different
priority class, which (cleanly) means just that processes in a lower
priority class will be served only when the processes of the target
app have no pending I/O request.  Finding a clean boosting API would
be one of the main steps to turn our prototype into a usable solution.

Thanks,
Paolo

> Thanks,
> 
> Bart.

Re: bfq-mq performance comparison to cfq

2017-04-18 Thread Bart Van Assche

On 04/11/17 00:29, Paolo Valente wrote:
>
>> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>>  ha scritto:
>>
>> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>>> That said, if you do always want maximum throughput, even at the
>>> expense of latency, then just switch off low-latency heuristics, i.e.,
>>> set low_latency to 0.  Depending on the device, setting slice_ilde to
>>> 0 may help a lot too (as well as with CFQ).  If the throughput is
>>> still low also after forcing BFQ to an only-throughput mode, then you
>>> hit some bug, and I'll have a little more work to do ...
>>
>> Has it been considered to make applications tell the I/O scheduler
>> whether to optimize for latency or for throughput? It shouldn't be that
>> hard for window managers and shells to figure out whether or not a new
>> application that is being started is interactive or not. This would
>> require a mechanism that allows applications to provide such information
>> to the I/O scheduler. Wouldn't that be a better approach than the I/O
>> scheduler trying to guess whether or not an application is an interactive
>> application?
>
> IMO that would be an (or maybe the) optimal solution, in terms of both
> throughput and latency.  We have even developed a prototype doing what
> you propose, for Android.  Unfortunately, I have not yet succeeded in
> getting support, to turn it into candidate production code, or to make
> a similar solution for lsb-compliant systems.

Hello Paolo,

What API was used by the Android application to tell the I/O scheduler 
to optimize for latency? Do you think that it would be sufficient if the 
application uses the ioprio_set() system call to set the I/O priority to 
IOPRIO_CLASS_RT?

Thanks,

Bart.

Re: bfq-mq performance comparison to cfq

2017-04-18 Thread Bart Van Assche

On 04/11/17 00:29, Paolo Valente wrote:
>
>> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>>  ha scritto:
>>
>> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>>> That said, if you do always want maximum throughput, even at the
>>> expense of latency, then just switch off low-latency heuristics, i.e.,
>>> set low_latency to 0.  Depending on the device, setting slice_ilde to
>>> 0 may help a lot too (as well as with CFQ).  If the throughput is
>>> still low also after forcing BFQ to an only-throughput mode, then you
>>> hit some bug, and I'll have a little more work to do ...
>>
>> Has it been considered to make applications tell the I/O scheduler
>> whether to optimize for latency or for throughput? It shouldn't be that
>> hard for window managers and shells to figure out whether or not a new
>> application that is being started is interactive or not. This would
>> require a mechanism that allows applications to provide such information
>> to the I/O scheduler. Wouldn't that be a better approach than the I/O
>> scheduler trying to guess whether or not an application is an interactive
>> application?
>
> IMO that would be an (or maybe the) optimal solution, in terms of both
> throughput and latency.  We have even developed a prototype doing what
> you propose, for Android.  Unfortunately, I have not yet succeeded in
> getting support, to turn it into candidate production code, or to make
> a similar solution for lsb-compliant systems.

Hello Paolo,

What API was used by the Android application to tell the I/O scheduler 
to optimize for latency? Do you think that it would be sufficient if the 
application uses the ioprio_set() system call to set the I/O priority to 
IOPRIO_CLASS_RT?

Thanks,

Bart.

Re: bfq-mq performance comparison to cfq

2017-04-11 Thread Andreas Herrmann

On Mon, Apr 10, 2017 at 11:55:43AM +0200, Paolo Valente wrote:
> 
> > Il giorno 10 apr 2017, alle ore 11:05, Andreas Herrmann 
> >  ha scritto:
> > 
> > Hi Paolo,
> > 
> > I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
> > and did some fio tests to compare the behavior to CFQ.
> > 
> > My understanding is that bfq-mq is supposed to be merged sooner or
> > later and then it will be the only reasonable I/O scheduler with
> > blk-mq for rotational devices. Hence I think it is interesting to see
> > what to expect performance-wise in comparison to CFQ which is usually
> > used for such devices with the legacy block layer.
> > 
> > I've just done simple tests iterating over number of jobs (1-8 as the
> > test system had 8 CPUs) for all (random/sequential) read/write
> > patterns. Fixed set of fio parameters used were '-size=5G
> > --group_reporting --ioengine=libaio --direct=1 --iodepth=1
> > --runtime=10'.
> > 
> > I've done 10 runs for each such configuration. The device used was an
> > older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
> > the most are those for sequential reads and sequential writes:
> > 
> > * sequential reads
> >  [0] - cfq, intel_pstate driver, powersave governor
> >  [1] - bfq_mq, intel_pstate driver, powersave governor
> > 
> > jo [0]   [1]
> > bs   mean stddevmean   stddev
> >  1 & 17060.300 &  77.090 & 17657.500 &  69.602
> >  2 & 15318.200 &  28.817 & 10678.000 & 279.070
> >  3 & 15403.200 &  42.762 &  9874.600 &  93.436
> >  4 & 14521.200 & 624.111 &  9918.700 & 226.425
> >  5 & 13893.900 & 144.354 &  9485.000 & 109.291
> >  6 & 13065.300 & 180.608 &  9419.800 &  75.043
> >  7 & 12169.600 &  95.422 &  9863.800 & 227.662
> >  8 & 12422.200 & 215.535 & 15335.300 & 245.764

For the sake of completeness here the corresponding results when
setting low_latency=0 for sequential reads

 [1] - bfq_mq, intel_pstate driver, powersave governor, low_latency=1 (default)
 [2] - bfq_mq, intel_pstate driver, powersave governor, low_latency=0

jo [2]   [1]
bs   mean stddevmean   stddev
 1 & 17959.500 &  62.376 & 17657.500 &  69.602
 2 & 16137.200 & 696.527 & 10678.000 & 279.070
 3 & 16223.600 &  41.291 &  9874.600 &  93.436
 4 & 16012.200 &  88.924 &  9918.700 & 226.425
 5 & 15937.900 &  51.172 &  9485.000 & 109.291
 6 & 15849.300 &  54.021 &  9419.800 &  75.043
 7 & 15794.300 &  98.857 &  9863.800 & 227.662
 8 & 15494.800 & 895.513 & 15335.300 & 245.764

> > * sequential writes
> >  [0] - cfq, intel_pstate driver, powersave governor
> >  [1] - bfq_mq, intel_pstate driver, powersave governor
> > 
> > jo[0]   [1]
> > bs  mean stddevmean   stddev
> >  1 & 14171.300 & 80.796 & 14392.500 & 182.587
> >  2 & 13520.000 & 88.967 &  9565.400 & 119.400
> >  3 & 13396.100 & 44.936 &  9284.000 &  25.122
> >  4 & 13139.800 & 62.325 &  8846.600 &  45.926
> >  5 & 12942.400 & 45.729 &  8568.700 &  35.852
> >  6 & 12650.600 & 41.283 &  8275.500 & 199.273
> >  7 & 12475.900 & 43.565 &  8252.200 &  33.145
> >  8 & 12307.200 & 43.594 & 13617.500 & 127.773

... and for sequential writes

 [1] - bfq_mq, intel_pstate driver, powersave governor, low_latency=1 (default)
 [2] - bfq_mq, intel_pstate driver, powersave governor, low_latency=0

jo [2]   [1]
bs   mean stddevmean   stddev

 1 & 1.800 & 248.806 & 14392.500 & 182.587
 2 & 13929.300 &  89.137 &  9565.400 & 119.400
 3 & 13875.400 &  83.084 &  9284.000 &  25.122
 4 & 13845.000 & 106.445 &  8846.600 &  45.926
 5 & 13784.800 &  66.304 &  8568.700 &  35.852
 6 & 13774.900 &  51.845 &  8275.500 & 199.273
 7 & 13741.900 &  92.647 &  8252.200 &  33.145
 8 & 13732.400 &  88.575 & 13617.500 & 127.773

> > With performance instead of powersave governor results were
> > (expectedly) higher but the pattern was the same -- bfq-mq shows a
> > "dent" for tests with 2-7 fio jobs. At the moment I have no
> > explanation for this behavior.
> > 
> 
> I have :)
> 
> BFQ, by default, is configured to privilege latency over throughput.
> In this respect, as various people and I happened to discuss a few
> times, even on these mailing lists, the only way to provide strong
> low-latency guarantees, at the moment, is through device idling.  The
> throughput loss you see is very likely to be the consequence of that
> idling.
> 
> Why does the throughput go back up at eight jobs?  Because, if many
> processes are born in a very short time interval, then BFQ understands
> that some multi-job task is being started.  And these parallel tasks
> usually prefer overall high throughput to single-process low latency.
> Then, BFQ does not idle the device for these processes.

Thanks for the explanation!

> That said, if you do always want maximum throughput, even at the
> expense of latency, then just switch off low-latency heuristics, i.e.,
> set low_latency to 0.

That helped a

Re: bfq-mq performance comparison to cfq

2017-04-11 Thread Andreas Herrmann

On Mon, Apr 10, 2017 at 11:55:43AM +0200, Paolo Valente wrote:
> 
> > Il giorno 10 apr 2017, alle ore 11:05, Andreas Herrmann 
> >  ha scritto:
> > 
> > Hi Paolo,
> > 
> > I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
> > and did some fio tests to compare the behavior to CFQ.
> > 
> > My understanding is that bfq-mq is supposed to be merged sooner or
> > later and then it will be the only reasonable I/O scheduler with
> > blk-mq for rotational devices. Hence I think it is interesting to see
> > what to expect performance-wise in comparison to CFQ which is usually
> > used for such devices with the legacy block layer.
> > 
> > I've just done simple tests iterating over number of jobs (1-8 as the
> > test system had 8 CPUs) for all (random/sequential) read/write
> > patterns. Fixed set of fio parameters used were '-size=5G
> > --group_reporting --ioengine=libaio --direct=1 --iodepth=1
> > --runtime=10'.
> > 
> > I've done 10 runs for each such configuration. The device used was an
> > older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
> > the most are those for sequential reads and sequential writes:
> > 
> > * sequential reads
> >  [0] - cfq, intel_pstate driver, powersave governor
> >  [1] - bfq_mq, intel_pstate driver, powersave governor
> > 
> > jo [0]   [1]
> > bs   mean stddevmean   stddev
> >  1 & 17060.300 &  77.090 & 17657.500 &  69.602
> >  2 & 15318.200 &  28.817 & 10678.000 & 279.070
> >  3 & 15403.200 &  42.762 &  9874.600 &  93.436
> >  4 & 14521.200 & 624.111 &  9918.700 & 226.425
> >  5 & 13893.900 & 144.354 &  9485.000 & 109.291
> >  6 & 13065.300 & 180.608 &  9419.800 &  75.043
> >  7 & 12169.600 &  95.422 &  9863.800 & 227.662
> >  8 & 12422.200 & 215.535 & 15335.300 & 245.764

For the sake of completeness here the corresponding results when
setting low_latency=0 for sequential reads

 [1] - bfq_mq, intel_pstate driver, powersave governor, low_latency=1 (default)
 [2] - bfq_mq, intel_pstate driver, powersave governor, low_latency=0

jo [2]   [1]
bs   mean stddevmean   stddev
 1 & 17959.500 &  62.376 & 17657.500 &  69.602
 2 & 16137.200 & 696.527 & 10678.000 & 279.070
 3 & 16223.600 &  41.291 &  9874.600 &  93.436
 4 & 16012.200 &  88.924 &  9918.700 & 226.425
 5 & 15937.900 &  51.172 &  9485.000 & 109.291
 6 & 15849.300 &  54.021 &  9419.800 &  75.043
 7 & 15794.300 &  98.857 &  9863.800 & 227.662
 8 & 15494.800 & 895.513 & 15335.300 & 245.764

> > * sequential writes
> >  [0] - cfq, intel_pstate driver, powersave governor
> >  [1] - bfq_mq, intel_pstate driver, powersave governor
> > 
> > jo[0]   [1]
> > bs  mean stddevmean   stddev
> >  1 & 14171.300 & 80.796 & 14392.500 & 182.587
> >  2 & 13520.000 & 88.967 &  9565.400 & 119.400
> >  3 & 13396.100 & 44.936 &  9284.000 &  25.122
> >  4 & 13139.800 & 62.325 &  8846.600 &  45.926
> >  5 & 12942.400 & 45.729 &  8568.700 &  35.852
> >  6 & 12650.600 & 41.283 &  8275.500 & 199.273
> >  7 & 12475.900 & 43.565 &  8252.200 &  33.145
> >  8 & 12307.200 & 43.594 & 13617.500 & 127.773

... and for sequential writes

 [1] - bfq_mq, intel_pstate driver, powersave governor, low_latency=1 (default)
 [2] - bfq_mq, intel_pstate driver, powersave governor, low_latency=0

jo [2]   [1]
bs   mean stddevmean   stddev

 1 & 1.800 & 248.806 & 14392.500 & 182.587
 2 & 13929.300 &  89.137 &  9565.400 & 119.400
 3 & 13875.400 &  83.084 &  9284.000 &  25.122
 4 & 13845.000 & 106.445 &  8846.600 &  45.926
 5 & 13784.800 &  66.304 &  8568.700 &  35.852
 6 & 13774.900 &  51.845 &  8275.500 & 199.273
 7 & 13741.900 &  92.647 &  8252.200 &  33.145
 8 & 13732.400 &  88.575 & 13617.500 & 127.773

> > With performance instead of powersave governor results were
> > (expectedly) higher but the pattern was the same -- bfq-mq shows a
> > "dent" for tests with 2-7 fio jobs. At the moment I have no
> > explanation for this behavior.
> > 
> 
> I have :)
> 
> BFQ, by default, is configured to privilege latency over throughput.
> In this respect, as various people and I happened to discuss a few
> times, even on these mailing lists, the only way to provide strong
> low-latency guarantees, at the moment, is through device idling.  The
> throughput loss you see is very likely to be the consequence of that
> idling.
> 
> Why does the throughput go back up at eight jobs?  Because, if many
> processes are born in a very short time interval, then BFQ understands
> that some multi-job task is being started.  And these parallel tasks
> usually prefer overall high throughput to single-process low latency.
> Then, BFQ does not idle the device for these processes.

Thanks for the explanation!

> That said, if you do always want maximum throughput, even at the
> expense of latency, then just switch off low-latency heuristics, i.e.,
> set low_latency to 0.

That helped a lot. (See above.)

Re: bfq-mq performance comparison to cfq

2017-04-11 Thread Paolo Valente


> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>  ha scritto:
> 
> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>> That said, if you do always want maximum throughput, even at the
>> expense of latency, then just switch off low-latency heuristics, i.e.,
>> set low_latency to 0.  Depending on the device, setting slice_ilde to
>> 0 may help a lot too (as well as with CFQ).  If the throughput is
>> still low also after forcing BFQ to an only-throughput mode, then you
>> hit some bug, and I'll have a little more work to do ...
> 
> Hello Paolo,
> 
> Has it been considered to make applications tell the I/O scheduler
> whether to optimize for latency or for throughput? It shouldn't be that
> hard for window managers and shells to figure out whether or not a new
> application that is being started is interactive or not. This would
> require a mechanism that allows applications to provide such information
> to the I/O scheduler. Wouldn't that be a better approach than the I/O
> scheduler trying to guess whether or not an application is an interactive
> application?
> 

IMO that would be an (or maybe the) optimal solution, in terms of both
throughput and latency.  We have even developed a prototype doing what
you propose, for Android.  Unfortunately, I have not yet succeeded in
getting support, to turn it into candidate production code, or to make
a similar solution for lsb-compliant systems.

Thanks,
Paolo


> Bart.

Re: bfq-mq performance comparison to cfq

2017-04-11 Thread Paolo Valente


> Il giorno 10 apr 2017, alle ore 17:15, Bart Van Assche 
>  ha scritto:
> 
> On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
>> That said, if you do always want maximum throughput, even at the
>> expense of latency, then just switch off low-latency heuristics, i.e.,
>> set low_latency to 0.  Depending on the device, setting slice_ilde to
>> 0 may help a lot too (as well as with CFQ).  If the throughput is
>> still low also after forcing BFQ to an only-throughput mode, then you
>> hit some bug, and I'll have a little more work to do ...
> 
> Hello Paolo,
> 
> Has it been considered to make applications tell the I/O scheduler
> whether to optimize for latency or for throughput? It shouldn't be that
> hard for window managers and shells to figure out whether or not a new
> application that is being started is interactive or not. This would
> require a mechanism that allows applications to provide such information
> to the I/O scheduler. Wouldn't that be a better approach than the I/O
> scheduler trying to guess whether or not an application is an interactive
> application?
> 

IMO that would be an (or maybe the) optimal solution, in terms of both
throughput and latency.  We have even developed a prototype doing what
you propose, for Android.  Unfortunately, I have not yet succeeded in
getting support, to turn it into candidate production code, or to make
a similar solution for lsb-compliant systems.

Thanks,
Paolo


> Bart.

Re: bfq-mq performance comparison to cfq

2017-04-11 Thread Paolo Valente


> Il giorno 10 apr 2017, alle ore 11:55, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 10 apr 2017, alle ore 11:05, Andreas Herrmann  
>> ha scritto:
>> 
>> Hi Paolo,
>> 
>> I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
>> and did some fio tests to compare the behavior to CFQ.
>> 
>> My understanding is that bfq-mq is supposed to be merged sooner or
>> later and then it will be the only reasonable I/O scheduler with
>> blk-mq for rotational devices. Hence I think it is interesting to see
>> what to expect performance-wise in comparison to CFQ which is usually
>> used for such devices with the legacy block layer.
>> 
>> I've just done simple tests iterating over number of jobs (1-8 as the
>> test system had 8 CPUs) for all (random/sequential) read/write
>> patterns. Fixed set of fio parameters used were '-size=5G
>> --group_reporting --ioengine=libaio --direct=1 --iodepth=1
>> --runtime=10'.
>> 
>> I've done 10 runs for each such configuration. The device used was an
>> older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
>> the most are those for sequential reads and sequential writes:
>> 
>> * sequential reads
>> [0] - cfq, intel_pstate driver, powersave governor
>> [1] - bfq_mq, intel_pstate driver, powersave governor
>> 
>> jo [0]   [1]
>> bs   mean stddevmean   stddev
>> 1 & 17060.300 &  77.090 & 17657.500 &  69.602
>> 2 & 15318.200 &  28.817 & 10678.000 & 279.070
>> 3 & 15403.200 &  42.762 &  9874.600 &  93.436
>> 4 & 14521.200 & 624.111 &  9918.700 & 226.425
>> 5 & 13893.900 & 144.354 &  9485.000 & 109.291
>> 6 & 13065.300 & 180.608 &  9419.800 &  75.043
>> 7 & 12169.600 &  95.422 &  9863.800 & 227.662
>> 8 & 12422.200 & 215.535 & 15335.300 & 245.764
>> 
>> * sequential writes
>> [0] - cfq, intel_pstate driver, powersave governor
>> [1] - bfq_mq, intel_pstate driver, powersave governor
>> 
>> jo[0]   [1]
>> bs  mean stddevmean   stddev
>> 1 & 14171.300 & 80.796 & 14392.500 & 182.587
>> 2 & 13520.000 & 88.967 &  9565.400 & 119.400
>> 3 & 13396.100 & 44.936 &  9284.000 &  25.122
>> 4 & 13139.800 & 62.325 &  8846.600 &  45.926
>> 5 & 12942.400 & 45.729 &  8568.700 &  35.852
>> 6 & 12650.600 & 41.283 &  8275.500 & 199.273
>> 7 & 12475.900 & 43.565 &  8252.200 &  33.145
>> 8 & 12307.200 & 43.594 & 13617.500 & 127.773
>> 
>> With performance instead of powersave governor results were
>> (expectedly) higher but the pattern was the same -- bfq-mq shows a
>> "dent" for tests with 2-7 fio jobs. At the moment I have no
>> explanation for this behavior.
>> 
> 
> I have :)
> 
> BFQ, by default, is configured to privilege latency over throughput.
> In this respect, as various people and I happened to discuss a few
> times, even on these mailing lists, the only way to provide strong
> low-latency guarantees, at the moment, is through device idling.  The
> throughput loss you see is very likely to be the consequence of that
> idling.
> 
> Why does the throughput go back up at eight jobs?  Because, if many
> processes are born in a very short time interval, then BFQ understands
> that some multi-job task is being started.  And these parallel tasks
> usually prefer overall high throughput to single-process low latency.
> Then, BFQ does not idle the device for these processes.
> 
> That said, if you do always want maximum throughput, even at the
> expense of latency, then just switch off low-latency heuristics, i.e.,
> set low_latency to 0.  Depending on the device, setting slice_ilde to
> 0 may help a lot too (as well as with CFQ).  If the throughput is
> still low also after forcing BFQ to an only-throughput mode, then you
> hit some bug, and I'll have a little more work to do ...
> 

I forgot two pieces of information:
1) The throughput drop lasts only for a few seconds, after which BFQ
stops caring about the latency of the newborn fio processes, and aims
only at throughput.
2) One of my main goals, if and after BFQ is merged, is to get about
the same low-latency guarantees, without idling, and thus without
losing throughput.

Paolo


> Thanks,
> Paolo
> 
>> Regards,
>> Andreas

Re: bfq-mq performance comparison to cfq

2017-04-11 Thread Paolo Valente


> Il giorno 10 apr 2017, alle ore 11:55, Paolo Valente 
>  ha scritto:
> 
>> 
>> Il giorno 10 apr 2017, alle ore 11:05, Andreas Herrmann  
>> ha scritto:
>> 
>> Hi Paolo,
>> 
>> I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
>> and did some fio tests to compare the behavior to CFQ.
>> 
>> My understanding is that bfq-mq is supposed to be merged sooner or
>> later and then it will be the only reasonable I/O scheduler with
>> blk-mq for rotational devices. Hence I think it is interesting to see
>> what to expect performance-wise in comparison to CFQ which is usually
>> used for such devices with the legacy block layer.
>> 
>> I've just done simple tests iterating over number of jobs (1-8 as the
>> test system had 8 CPUs) for all (random/sequential) read/write
>> patterns. Fixed set of fio parameters used were '-size=5G
>> --group_reporting --ioengine=libaio --direct=1 --iodepth=1
>> --runtime=10'.
>> 
>> I've done 10 runs for each such configuration. The device used was an
>> older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
>> the most are those for sequential reads and sequential writes:
>> 
>> * sequential reads
>> [0] - cfq, intel_pstate driver, powersave governor
>> [1] - bfq_mq, intel_pstate driver, powersave governor
>> 
>> jo [0]   [1]
>> bs   mean stddevmean   stddev
>> 1 & 17060.300 &  77.090 & 17657.500 &  69.602
>> 2 & 15318.200 &  28.817 & 10678.000 & 279.070
>> 3 & 15403.200 &  42.762 &  9874.600 &  93.436
>> 4 & 14521.200 & 624.111 &  9918.700 & 226.425
>> 5 & 13893.900 & 144.354 &  9485.000 & 109.291
>> 6 & 13065.300 & 180.608 &  9419.800 &  75.043
>> 7 & 12169.600 &  95.422 &  9863.800 & 227.662
>> 8 & 12422.200 & 215.535 & 15335.300 & 245.764
>> 
>> * sequential writes
>> [0] - cfq, intel_pstate driver, powersave governor
>> [1] - bfq_mq, intel_pstate driver, powersave governor
>> 
>> jo[0]   [1]
>> bs  mean stddevmean   stddev
>> 1 & 14171.300 & 80.796 & 14392.500 & 182.587
>> 2 & 13520.000 & 88.967 &  9565.400 & 119.400
>> 3 & 13396.100 & 44.936 &  9284.000 &  25.122
>> 4 & 13139.800 & 62.325 &  8846.600 &  45.926
>> 5 & 12942.400 & 45.729 &  8568.700 &  35.852
>> 6 & 12650.600 & 41.283 &  8275.500 & 199.273
>> 7 & 12475.900 & 43.565 &  8252.200 &  33.145
>> 8 & 12307.200 & 43.594 & 13617.500 & 127.773
>> 
>> With performance instead of powersave governor results were
>> (expectedly) higher but the pattern was the same -- bfq-mq shows a
>> "dent" for tests with 2-7 fio jobs. At the moment I have no
>> explanation for this behavior.
>> 
> 
> I have :)
> 
> BFQ, by default, is configured to privilege latency over throughput.
> In this respect, as various people and I happened to discuss a few
> times, even on these mailing lists, the only way to provide strong
> low-latency guarantees, at the moment, is through device idling.  The
> throughput loss you see is very likely to be the consequence of that
> idling.
> 
> Why does the throughput go back up at eight jobs?  Because, if many
> processes are born in a very short time interval, then BFQ understands
> that some multi-job task is being started.  And these parallel tasks
> usually prefer overall high throughput to single-process low latency.
> Then, BFQ does not idle the device for these processes.
> 
> That said, if you do always want maximum throughput, even at the
> expense of latency, then just switch off low-latency heuristics, i.e.,
> set low_latency to 0.  Depending on the device, setting slice_ilde to
> 0 may help a lot too (as well as with CFQ).  If the throughput is
> still low also after forcing BFQ to an only-throughput mode, then you
> hit some bug, and I'll have a little more work to do ...
> 

I forgot two pieces of information:
1) The throughput drop lasts only for a few seconds, after which BFQ
stops caring about the latency of the newborn fio processes, and aims
only at throughput.
2) One of my main goals, if and after BFQ is merged, is to get about
the same low-latency guarantees, without idling, and thus without
losing throughput.

Paolo


> Thanks,
> Paolo
> 
>> Regards,
>> Andreas

Re: bfq-mq performance comparison to cfq

2017-04-10 Thread Bart Van Assche

On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
> That said, if you do always want maximum throughput, even at the
> expense of latency, then just switch off low-latency heuristics, i.e.,
> set low_latency to 0.  Depending on the device, setting slice_ilde to
> 0 may help a lot too (as well as with CFQ).  If the throughput is
> still low also after forcing BFQ to an only-throughput mode, then you
> hit some bug, and I'll have a little more work to do ...

Hello Paolo,

Has it been considered to make applications tell the I/O scheduler
whether to optimize for latency or for throughput? It shouldn't be that
hard for window managers and shells to figure out whether or not a new
application that is being started is interactive or not. This would
require a mechanism that allows applications to provide such information
to the I/O scheduler. Wouldn't that be a better approach than the I/O
scheduler trying to guess whether or not an application is an interactive
application?

Bart.

Re: bfq-mq performance comparison to cfq

2017-04-10 Thread Bart Van Assche

On Mon, 2017-04-10 at 11:55 +0200, Paolo Valente wrote:
> That said, if you do always want maximum throughput, even at the
> expense of latency, then just switch off low-latency heuristics, i.e.,
> set low_latency to 0.  Depending on the device, setting slice_ilde to
> 0 may help a lot too (as well as with CFQ).  If the throughput is
> still low also after forcing BFQ to an only-throughput mode, then you
> hit some bug, and I'll have a little more work to do ...

Hello Paolo,

Has it been considered to make applications tell the I/O scheduler
whether to optimize for latency or for throughput? It shouldn't be that
hard for window managers and shells to figure out whether or not a new
application that is being started is interactive or not. This would
require a mechanism that allows applications to provide such information
to the I/O scheduler. Wouldn't that be a better approach than the I/O
scheduler trying to guess whether or not an application is an interactive
application?

Bart.

Re: bfq-mq performance comparison to cfq

2017-04-10 Thread Paolo Valente

> Il giorno 10 apr 2017, alle ore 11:05, Andreas Herrmann  
> ha scritto:
> 
> Hi Paolo,
> 
> I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
> and did some fio tests to compare the behavior to CFQ.
> 
> My understanding is that bfq-mq is supposed to be merged sooner or
> later and then it will be the only reasonable I/O scheduler with
> blk-mq for rotational devices. Hence I think it is interesting to see
> what to expect performance-wise in comparison to CFQ which is usually
> used for such devices with the legacy block layer.
> 
> I've just done simple tests iterating over number of jobs (1-8 as the
> test system had 8 CPUs) for all (random/sequential) read/write
> patterns. Fixed set of fio parameters used were '-size=5G
> --group_reporting --ioengine=libaio --direct=1 --iodepth=1
> --runtime=10'.
> 
> I've done 10 runs for each such configuration. The device used was an
> older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
> the most are those for sequential reads and sequential writes:
> 
> * sequential reads
>  [0] - cfq, intel_pstate driver, powersave governor
>  [1] - bfq_mq, intel_pstate driver, powersave governor
> 
> jo [0]   [1]
> bs   mean stddevmean   stddev
>  1 & 17060.300 &  77.090 & 17657.500 &  69.602
>  2 & 15318.200 &  28.817 & 10678.000 & 279.070
>  3 & 15403.200 &  42.762 &  9874.600 &  93.436
>  4 & 14521.200 & 624.111 &  9918.700 & 226.425
>  5 & 13893.900 & 144.354 &  9485.000 & 109.291
>  6 & 13065.300 & 180.608 &  9419.800 &  75.043
>  7 & 12169.600 &  95.422 &  9863.800 & 227.662
>  8 & 12422.200 & 215.535 & 15335.300 & 245.764
> 
> * sequential writes
>  [0] - cfq, intel_pstate driver, powersave governor
>  [1] - bfq_mq, intel_pstate driver, powersave governor
> 
> jo[0]   [1]
> bs  mean stddevmean   stddev
>  1 & 14171.300 & 80.796 & 14392.500 & 182.587
>  2 & 13520.000 & 88.967 &  9565.400 & 119.400
>  3 & 13396.100 & 44.936 &  9284.000 &  25.122
>  4 & 13139.800 & 62.325 &  8846.600 &  45.926
>  5 & 12942.400 & 45.729 &  8568.700 &  35.852
>  6 & 12650.600 & 41.283 &  8275.500 & 199.273
>  7 & 12475.900 & 43.565 &  8252.200 &  33.145
>  8 & 12307.200 & 43.594 & 13617.500 & 127.773
> 
> With performance instead of powersave governor results were
> (expectedly) higher but the pattern was the same -- bfq-mq shows a
> "dent" for tests with 2-7 fio jobs. At the moment I have no
> explanation for this behavior.
> 

I have :)

BFQ, by default, is configured to privilege latency over throughput.
In this respect, as various people and I happened to discuss a few
times, even on these mailing lists, the only way to provide strong
low-latency guarantees, at the moment, is through device idling.  The
throughput loss you see is very likely to be the consequence of that
idling.

Why does the throughput go back up at eight jobs?  Because, if many
processes are born in a very short time interval, then BFQ understands
that some multi-job task is being started.  And these parallel tasks
usually prefer overall high throughput to single-process low latency.
Then, BFQ does not idle the device for these processes.

That said, if you do always want maximum throughput, even at the
expense of latency, then just switch off low-latency heuristics, i.e.,
set low_latency to 0.  Depending on the device, setting slice_ilde to
0 may help a lot too (as well as with CFQ).  If the throughput is
still low also after forcing BFQ to an only-throughput mode, then you
hit some bug, and I'll have a little more work to do ...

Thanks,
Paolo

> Regards,
> Andreas

Re: bfq-mq performance comparison to cfq

2017-04-10 Thread Paolo Valente

> Il giorno 10 apr 2017, alle ore 11:05, Andreas Herrmann  
> ha scritto:
> 
> Hi Paolo,
> 
> I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
> and did some fio tests to compare the behavior to CFQ.
> 
> My understanding is that bfq-mq is supposed to be merged sooner or
> later and then it will be the only reasonable I/O scheduler with
> blk-mq for rotational devices. Hence I think it is interesting to see
> what to expect performance-wise in comparison to CFQ which is usually
> used for such devices with the legacy block layer.
> 
> I've just done simple tests iterating over number of jobs (1-8 as the
> test system had 8 CPUs) for all (random/sequential) read/write
> patterns. Fixed set of fio parameters used were '-size=5G
> --group_reporting --ioengine=libaio --direct=1 --iodepth=1
> --runtime=10'.
> 
> I've done 10 runs for each such configuration. The device used was an
> older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
> the most are those for sequential reads and sequential writes:
> 
> * sequential reads
>  [0] - cfq, intel_pstate driver, powersave governor
>  [1] - bfq_mq, intel_pstate driver, powersave governor
> 
> jo [0]   [1]
> bs   mean stddevmean   stddev
>  1 & 17060.300 &  77.090 & 17657.500 &  69.602
>  2 & 15318.200 &  28.817 & 10678.000 & 279.070
>  3 & 15403.200 &  42.762 &  9874.600 &  93.436
>  4 & 14521.200 & 624.111 &  9918.700 & 226.425
>  5 & 13893.900 & 144.354 &  9485.000 & 109.291
>  6 & 13065.300 & 180.608 &  9419.800 &  75.043
>  7 & 12169.600 &  95.422 &  9863.800 & 227.662
>  8 & 12422.200 & 215.535 & 15335.300 & 245.764
> 
> * sequential writes
>  [0] - cfq, intel_pstate driver, powersave governor
>  [1] - bfq_mq, intel_pstate driver, powersave governor
> 
> jo[0]   [1]
> bs  mean stddevmean   stddev
>  1 & 14171.300 & 80.796 & 14392.500 & 182.587
>  2 & 13520.000 & 88.967 &  9565.400 & 119.400
>  3 & 13396.100 & 44.936 &  9284.000 &  25.122
>  4 & 13139.800 & 62.325 &  8846.600 &  45.926
>  5 & 12942.400 & 45.729 &  8568.700 &  35.852
>  6 & 12650.600 & 41.283 &  8275.500 & 199.273
>  7 & 12475.900 & 43.565 &  8252.200 &  33.145
>  8 & 12307.200 & 43.594 & 13617.500 & 127.773
> 
> With performance instead of powersave governor results were
> (expectedly) higher but the pattern was the same -- bfq-mq shows a
> "dent" for tests with 2-7 fio jobs. At the moment I have no
> explanation for this behavior.
> 

I have :)

BFQ, by default, is configured to privilege latency over throughput.
In this respect, as various people and I happened to discuss a few
times, even on these mailing lists, the only way to provide strong
low-latency guarantees, at the moment, is through device idling.  The
throughput loss you see is very likely to be the consequence of that
idling.

Why does the throughput go back up at eight jobs?  Because, if many
processes are born in a very short time interval, then BFQ understands
that some multi-job task is being started.  And these parallel tasks
usually prefer overall high throughput to single-process low latency.
Then, BFQ does not idle the device for these processes.

That said, if you do always want maximum throughput, even at the
expense of latency, then just switch off low-latency heuristics, i.e.,
set low_latency to 0.  Depending on the device, setting slice_ilde to
0 may help a lot too (as well as with CFQ).  If the throughput is
still low also after forcing BFQ to an only-throughput mode, then you
hit some bug, and I'll have a little more work to do ...

Thanks,
Paolo

> Regards,
> Andreas

bfq-mq performance comparison to cfq

2017-04-10 Thread Andreas Herrmann

Hi Paolo,

I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
and did some fio tests to compare the behavior to CFQ.

My understanding is that bfq-mq is supposed to be merged sooner or
later and then it will be the only reasonable I/O scheduler with
blk-mq for rotational devices. Hence I think it is interesting to see
what to expect performance-wise in comparison to CFQ which is usually
used for such devices with the legacy block layer.

I've just done simple tests iterating over number of jobs (1-8 as the
test system had 8 CPUs) for all (random/sequential) read/write
patterns. Fixed set of fio parameters used were '-size=5G
--group_reporting --ioengine=libaio --direct=1 --iodepth=1
--runtime=10'.

I've done 10 runs for each such configuration. The device used was an
older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
the most are those for sequential reads and sequential writes:

 * sequential reads
  [0] - cfq, intel_pstate driver, powersave governor
  [1] - bfq_mq, intel_pstate driver, powersave governor

 jo [0]   [1]
 bs   mean stddevmean   stddev
  1 & 17060.300 &  77.090 & 17657.500 &  69.602
  2 & 15318.200 &  28.817 & 10678.000 & 279.070
  3 & 15403.200 &  42.762 &  9874.600 &  93.436
  4 & 14521.200 & 624.111 &  9918.700 & 226.425
  5 & 13893.900 & 144.354 &  9485.000 & 109.291
  6 & 13065.300 & 180.608 &  9419.800 &  75.043
  7 & 12169.600 &  95.422 &  9863.800 & 227.662
  8 & 12422.200 & 215.535 & 15335.300 & 245.764

 * sequential writes
  [0] - cfq, intel_pstate driver, powersave governor
  [1] - bfq_mq, intel_pstate driver, powersave governor

 jo[0]   [1]
 bs  mean stddevmean   stddev
  1 & 14171.300 & 80.796 & 14392.500 & 182.587
  2 & 13520.000 & 88.967 &  9565.400 & 119.400
  3 & 13396.100 & 44.936 &  9284.000 &  25.122
  4 & 13139.800 & 62.325 &  8846.600 &  45.926
  5 & 12942.400 & 45.729 &  8568.700 &  35.852
  6 & 12650.600 & 41.283 &  8275.500 & 199.273
  7 & 12475.900 & 43.565 &  8252.200 &  33.145
  8 & 12307.200 & 43.594 & 13617.500 & 127.773

With performance instead of powersave governor results were
(expectedly) higher but the pattern was the same -- bfq-mq shows a
"dent" for tests with 2-7 fio jobs. At the moment I have no
explanation for this behavior.


Regards,
Andreas

bfq-mq performance comparison to cfq

2017-04-10 Thread Andreas Herrmann

Hi Paolo,

I've looked at your WIP branch as of 4.11.0-bfq-mq-rc4-00155-gbce0818
and did some fio tests to compare the behavior to CFQ.

My understanding is that bfq-mq is supposed to be merged sooner or
later and then it will be the only reasonable I/O scheduler with
blk-mq for rotational devices. Hence I think it is interesting to see
what to expect performance-wise in comparison to CFQ which is usually
used for such devices with the legacy block layer.

I've just done simple tests iterating over number of jobs (1-8 as the
test system had 8 CPUs) for all (random/sequential) read/write
patterns. Fixed set of fio parameters used were '-size=5G
--group_reporting --ioengine=libaio --direct=1 --iodepth=1
--runtime=10'.

I've done 10 runs for each such configuration. The device used was an
older SAMSUNG HD103SJ 1TB disk, SATA attached. Results that stick out
the most are those for sequential reads and sequential writes:

 * sequential reads
  [0] - cfq, intel_pstate driver, powersave governor
  [1] - bfq_mq, intel_pstate driver, powersave governor

 jo [0]   [1]
 bs   mean stddevmean   stddev
  1 & 17060.300 &  77.090 & 17657.500 &  69.602
  2 & 15318.200 &  28.817 & 10678.000 & 279.070
  3 & 15403.200 &  42.762 &  9874.600 &  93.436
  4 & 14521.200 & 624.111 &  9918.700 & 226.425
  5 & 13893.900 & 144.354 &  9485.000 & 109.291
  6 & 13065.300 & 180.608 &  9419.800 &  75.043
  7 & 12169.600 &  95.422 &  9863.800 & 227.662
  8 & 12422.200 & 215.535 & 15335.300 & 245.764

 * sequential writes
  [0] - cfq, intel_pstate driver, powersave governor
  [1] - bfq_mq, intel_pstate driver, powersave governor

 jo[0]   [1]
 bs  mean stddevmean   stddev
  1 & 14171.300 & 80.796 & 14392.500 & 182.587
  2 & 13520.000 & 88.967 &  9565.400 & 119.400
  3 & 13396.100 & 44.936 &  9284.000 &  25.122
  4 & 13139.800 & 62.325 &  8846.600 &  45.926
  5 & 12942.400 & 45.729 &  8568.700 &  35.852
  6 & 12650.600 & 41.283 &  8275.500 & 199.273
  7 & 12475.900 & 43.565 &  8252.200 &  33.145
  8 & 12307.200 & 43.594 & 13617.500 & 127.773

With performance instead of powersave governor results were
(expectedly) higher but the pattern was the same -- bfq-mq shows a
"dent" for tests with 2-7 fio jobs. At the moment I have no
explanation for this behavior.


Regards,
Andreas

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

Re: bfq-mq performance comparison to cfq

bfq-mq performance comparison to cfq

bfq-mq performance comparison to cfq

24 matches

Site Navigation

Mail list logo

Footer information