Adding again Ulf, Linus and all the others, because Tejun replied to my initial 
email, which did not include them yet as recipients.

> Il giorno 18 ott 2017, alle ore 17:02, Paolo Valente 
> <paolo.vale...@linaro.org> ha scritto:
> 
>> 
>> Il giorno 18 ott 2017, alle ore 15:19, Tejun Heo <t...@kernel.org> ha 
>> scritto:
>> 
>> Hello, Paolo.
>> 
>> On Tue, Oct 17, 2017 at 12:11:01PM +0200, Paolo Valente wrote:
>> ...
>>> protected by a per-device scheduler lock.  To give you an idea, on an
>>> Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on
>>> null_blk (configured with 0 latency), if the update of groups stats is
>>> removed, then the throughput grows from 260 to 404 KIOPS.  This and
>>> all the other results we might share in this thread can be reproduced
>>> very easily with a (useful) script made by Luca Miccio [1].
>> 
>> I don't think the old request_queue is ever built for multiple CPUs
>> hitting on a mem-backed device.
>> 
>>> We tried to understand the reason for this high overhead, and, in
>>> particular, to find out whether whether there was some issue that we
>>> could address on our own.  But the causes seem somehow substantial:
>>> one of the most time-consuming operations needed by some blkg_*stats_*
>>> functions is, e.g., find_next_bit, for which we don't see any trivial
>>> replacement.
>> 
>> Can you point to the specific ones?  I can't find find_next_bit usages
>> in generic blkg code.
>> 
> 
> Yes, sorry for being too generic in the first place (fear to write too
> much).
> 
> I have attached a flame graph (made by Luca), showing all involved
> functions.  Look, e.g., for the blkg_*stat_* functions invoked
> indirectly by bfq_dispatch_request, inside any of the worker
> processes.  As I already wrote, find_next_bit seems to be only part of
> the cost of these functions (although an important part).
> 
> You can obtain/reproduce the information in the flame graph (on a 8
> logical-core cpu), by invoking
> 
> perf record -g -a —callgraph dwarf -F 999
> 
> and, in parallel,
> 
> sudo ./IO_sched-speedtest.sh 20 8 bfq randread
> 
> where IO_sched-speedtest.sh is the script I mentioned in my previous
> email [1]
> 
> [1] https://github.com/Algodev-github/IOSpeed
> 
>>> So, as a first attempt to reduce this severe slowdown, we have made a
>>> patch that moves the invocation of blkg_*stats_* functions outside the
>>> critical sections protected by the bfq lock.  Still, these functions
>>> apparently need to be protected with the request_queue lock, because
>> 
>> blkgs are already protected with RCU, so RCU protection should be
>> enough.
>> 
> 
> blkgs are, but the blkg_stat objects passed to the blkg_*stat_*
> functions by bfq are not.  In particular, these objects are contained
> in bfq_group objects.  Anyway, as I wrote, the cost of using the
> request_queue lock seems to be a loss of about 5% of the throughput.
> So, I guess that replacing this lock with RCU protection would
> probably reduce this loss to only 2% or 3%.  I wonder whether such a
> gain would be worth the additional conceptual complexity of RCU; at
> least untill the major problem, i.e,, the apparently high cost of stat
> update, is solved somehow.
> 
> Thanks,
> Paolo
> 
>> Thanks.
>> 
>> -- 
>> tejun
> 
> <bfq-tracing-cgroup.svg>

Reply via email to