Hi Marcus,
Thanks for the detailed answer.
After a long period, I managed to work on it (in several steps) during the
last couple of months.

I managed to find the bottleneck using ControlPorts. It showed a couple of
blocks that were bottlenecks (based on previous block's output buffers).
Interestingly, they weren't IO-bound. I haven't investigated it further,
but I guess that the facts that (i) they caused dropped samples; (ii) were
CPU-bound; and (iii) did not reach anywhere near 100% in htop indicate that
they might have high fluctuations in CPU usage, it might be that they do
use 100% CPU for a very short time (then they cause dropped samples),
shorter than the sampling and averaging periods of tools like htop.
Optimizing those blocks solved the issue of dropped samples.

I only gave kernelshark little attention. It didn't prove to be a simple
tool to use out of the box, I don't think that the docs and level of polish
are well enough for this general use case. Maybe, if someone takes it as a
project and documents how to apply it to flowgraph inspection.

On Tue, Nov 7, 2017 at 1:25 PM Marcus Müller <muel...@kit.edu> wrote:

> Hi Gilad,
> part of this is for the future reader of this thread, so, please, bear
> with me:
> On 07.11.2017 10:42, Gilad Beeri (ApolloShield) wrote:
> > I have a flowgraph, that when run, no CPU core is ever close to 100%
> > utilization.
> Indeed, dropped samples indicate a bottleneck narrower than your USRP's
> sampling rate, but that bottleneck doesn't have to be CPU overutilization!
> Simplest example: add a Throttle block to a flow graph that otherwise
> wouldn't produce any problems with half the necessary sampling rate.
> Most often, I find that IO operations actually become the the bottleneck
> – be it that sending samples to the USRP (or receiving them) is actually
> pretty time-intense, or that you need to interact with storage.
> Depending on the tooling you choose, this fact might or might not be
> hidden; time spent, for example "on behalf" of a thread in Kernel land,
> searching for a contiguous piece of memory to give to that process, or
> handling USB buffers or... might or might not be attributed to the process.
> Another very classical problem is memory bandwidth and latency; so, as
> shown by SE at this year's GRCon, chances aren't that bad that you can
> optimize quite a bit if you co-locate connected blocks on the same CPU,
> you get a caching advantage (or, rather, not incur a disadvantage).
> That all being said, how do you proceed?
> First of all, this is one of the cases where having ControlPort is very
> helpful. If you have it (with Thrift and PerfCounters enabled), you can
> start the CtrlPort Performance Monitor, and see which output buffers
> "stay full" all the time. Block after that is probably your bottleneck.
> If you don't, try running `perf top -ag` (as root might help here, you
> want to also inspect kernel times, not quite sure about that, though).
> You should be getting a listing of "when we sampled where the CPU(s)
> were, in x % of the time, they were stuck in these functions".
> I really tried, but haven't had the time to work with kernelshark. That
> might really be a tool of choice here. In fact, it looks so cool that I
> could imagine that we one day supersede the perf counter concept with
> that; who knows.
> If you do happen to look into that, I'd be very happy to get some
> feedback about the process, and what the problems were. I think this is
> definitely something we want to enable users to do – understand not only
> the behaviour of their blocks in isolation, but how a system works.
> After all, one of the major "let's dream about a GNU Radio in the
> future" things we're considering is making it easy to distribute a flow
> graph across computers, and for that, systemic insight pretty much is a
> must.
> Best regards,
> Marcus
> _______________________________________________
> Discuss-gnuradio mailing list
> Discuss-gnuradio@gnu.org
> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
Discuss-gnuradio mailing list

Reply via email to