On Tue, Dec 17, 2024 at 03:30:51PM +0100, Eelco Chaudron wrote:
>
>
> On 13 Dec 2024, at 13:56, Adrián Moreno wrote:
>
> > On Mon, Nov 25, 2024 at 03:14:20PM +0100, Eelco Chaudron wrote:
> >> This patch installs a kernel return probe on ovs_dp_upcall() to record
> >> all successful and failed calls per CPU. The statistics are reported
> >> when the script exits, providing insights into the upcall behavior.
> >>
> >> This is an example output:
> >>
> >> # UPCALL STATISTICS:
> >> Total upcalls : 183247 [0: 8937, 2: 14464, 4: 10372, 6: 4254,
> >> ...]
> >> Successfull upcalls : 120195 [0: 5945, 2: 6379, 4: 5966, 6: 4254, ...]
> >> Failed upcalls : 63052 [0: 2992, 2: 8085, 4: 4406, 8: 2275, ...]
> >> 11, EAGAIN : 63052 [0: 2992, 2: 8085, 4: 4406, 8: 2275, ...]
> >>
> >> Signed-off-by: Eelco Chaudron <[email protected]>
> >
> > Thanks for this patch. Correlating upcall failures to kernel delays is
> > extremely useful. Some comments below.
>
> Thanks Adrian for the review. I’ll send out a v2 in a minute, but see my
> responses below.
>
> //Eelco
>
> >> ---
> >> utilities/automake.mk | 1
> >> utilities/usdt-scripts/kernel_delay.py | 127
> >> +++++++++++++++++++++++++++++++
> >> utilities/usdt-scripts/kernel_delay.rst | 9 ++
> >> 3 files changed, 136 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/utilities/automake.mk b/utilities/automake.mk
> >> index acc1af4e0..22260b254 100644
> >> --- a/utilities/automake.mk
> >> +++ b/utilities/automake.mk
> >> @@ -149,6 +149,7 @@ FLAKE8_PYFILES += utilities/ovs-pcap.in \
> >> utilities/ovs-pipegen.py \
> >> utilities/usdt-scripts/dpif_op_nl_monitor.py \
> >> utilities/usdt-scripts/flow_reval_monitor.py \
> >> + utilities/usdt-scripts/kernel_delay.py \
> >> utilities/usdt-scripts/upcall_monitor.py \
> >> utilities/usdt-scripts/upcall_cost.py
> >>
> >> diff --git a/utilities/usdt-scripts/kernel_delay.py
> >> b/utilities/usdt-scripts/kernel_delay.py
> >> index de6b0c9de..a5aa1a3d4 100755
> >> --- a/utilities/usdt-scripts/kernel_delay.py
> >> +++ b/utilities/usdt-scripts/kernel_delay.py
> >> @@ -39,6 +39,7 @@
> >> #
> >> import argparse
> >> import datetime
> >> +import errno
> >> import os
> >> import pytz
> >> import psutil
> >> @@ -556,6 +557,30 @@ TRACEPOINT_PROBE(irq, softirq_exit)
> >> data->start_ns = 0;
> >> return 0;
> >> }
> >> +
> >> +
> >> +/*
> >> + * For measuring upcall statistics (per CPU).
> >> + */
> >> +#define MAX_ERROR 255
> >
> > Why this limit? The kernel seems to use 4095 in the PTR_ERR
> > infrastructure. Although not exposed to user applications,
> > kernel has some > 255 values. They don't seem networking related for
> > now but still.
> >
> > In order to reduce this arbitrary limit and avoid needless iteration
> > over an almost empty array, I'd make it a percpu_hash. It supports
> > "lookup_or_try_init" and "increment" so it's as easy to use from ebpf
> > and almost identically from python.
>
> Good point, I will use a per-CPU hash table instead.
>
> >> +BPF_TABLE("percpu_array", uint32_t, uint64_t, upcall_count, MAX_ERROR +
> >> 1);
> >> +
> >> +#if <INSTALL_OVS_DP_UPCALL_PROBE>
> >> +int kretprobe__ovs_dp_upcall(struct pt_regs *ctx)
> >> +{
> >
> > When trying the script I was a bit confused that numbers didn't add up,
> > e.g:
> >
> > # UPCALL STATISTICS:
> > Total upcalls : 776234 [0: 204345, 1: 38773, 2: 38206, 3:
> > 38700, 4: 104040, 5: 77400, 6: 70290, 8: 69705, 9: 67410, 10: 14220,
> > 11: 53145]
> > Successfull upcalls : 268640 [0: 72829, 1: 14556, 2: 12876, 3:
> > 13192, 4: 34227, 5: 26480, 6: 24130, 8: 23791, 9: 23372, 10: 6164, 11:
> > 17023]
> > Failed upcalls : 508564 [0: 131516, 1: 25187, 2: 25330, 3:
> > 25508, 4: 69813, 5: 50920, 6: 46160, 8: 45914, 9: 44038, 10: 8056, 11:
> > 36122]
> > 11, EAGAIN : 509485 [0: 131516, 1: 26112, 2: 25330, 3:
> > 25508, 4: 69813, 5: 50920, 6: 46160, 8: 45914, 9: 44038, 10: 8056, 11:
> > 36122]
> >
> > Total of EAGAIN erros != total Failed and
> > Total Failed + Total Successfull != Total.
> >
> > I believe the reason is that events of this kind are still being
> > triggered while results are being processed. The problem seems to go
> > away if I add the following lines at the top of this probe:
> >
> > if (!capture_enabled__())
> > return 0;
>
> ACK, I will fix this, including resetting the values between runs…
>
> >> + int ret = PT_REGS_RC(ctx);
> >> +
> >> + if (ret > 0)
> >> + ret = 0;
> >> +
> >> + ret = -ret;
> >> + if (ret > MAX_ERROR)
> >> + ret = MAX_ERROR;
> >> +
> >> + upcall_count.increment(ret);
> >> + return 0;
> >> +}
> >> +#endif /* For measuring upcall statistics/errors. */
> >> """
> >>
> >>
> >> @@ -900,6 +925,95 @@ def print_timestamp(msg):
> >> print(time_string)
> >>
> >>
> >> +#
> >> +# Get errno short string
> >> +#
> >> +def get_errno_short(err):
> >> + try:
> >> + return errno.errorcode[err]
> >> + except KeyError:
> >> + return "_unknown_"
> >> +
> >> +
> >> +#
> >> +# Format a bfd per cpu array entry per cpu (if the count is > 0)
> >> +#
> >
> > s/bfd/bpf?
> > Also, "per cpu" twice makes the sentense a bit difficult to read.
>
> Ack
>
> >
> >> +def format_per_cpu_array(cpu_array, key=None, skip_key=None):
> >> + per_cpu = ""
> >> +
> >> + if key is not None:
> >
> > "is not None" is redundant: "if key:" is equivalent.
>
> This and the one below are not the same for my code; ‘if key:’ behaves
> incorrectly:
>
Ugh, you're right. If key is zero the check fails.
> # UPCALL STATISTICS (TOTAL [CPU_ID: UPCALL_PER_CPU, ...]):
> Total upcalls : 182611 [0: 14506, 2: 13830, 4: 10348, 6: 8752, 8:
> 8930,...]
> Successfull upcalls : 182611 [0: 14506, 2: 13830, 4: 10348, 6: 8752, 8:
> 8930,...]
> Failed upcalls : 182611 [0: 14506, 2: 13830, 4: 10348, 6: 8752, 8:
> 8930,...]
> 11, EAGAIN : 59526 [0: 7701, 2: 7580, 4: 4443, 6: 2745, 8:
> 3013,...]
>
> vs ‘if key is not None’:
>
> # UPCALL STATISTICS (TOTAL [CPU_ID: UPCALL_PER_CPU, ...]):
> Total upcalls : 209023 [0: 16271, 2: 14830, 4: 11787, 6: 10025, 8:
> 4875,...]
> Successfull upcalls : 127677 [0: 6342, 2: 6250, 4: 5547, 6: 5464, 8:
> 4875,...]
> Failed upcalls : 81346 [0: 9929, 2: 8580, 4: 6240, 6: 4561, 10:
> 4602,...]
> 11, EAGAIN : 81346 [0: 9929, 2: 8580, 4: 6240, 6: 4561, 10:
> 4602,...]
>
>
> >> + total = cpu_array.sum(key).value
> >> + if total > 0:
> >> + for cpu, value in enumerate(cpu_array.getvalue(key)):
> >> + if value == 0:
> >> + continue
> >> +
> >> + per_cpu += " {}: {},".format(cpu, value)
> >> + else:
> >> + total = 0
> >> + total_cpu = None
> >> +
> >> + for key in cpu_array.keys():
> >
> > See below.
> >
> >> + # I've seen extrem odd hight values, to avoid those, exclude
> >> any
> >
> > typo? "extremely odd high values"?
> >
> >> + # index higher thant 1024 for now.
> >
> > What values you were getting? I'm not seing any weird value.
> > Besides, isn't the maximum set in the ebpf side to 255?
> > Plus, it's an array so its size is limitted and no key can exceed 256
> > right?
>
> This was a while back, I cant remember if it was a specific kernel or BCC
> version.
> I would get indexes that look like integer overlows. However with the hash
> implementation this is not longer requered.
>
> >> + if (not (0 <= key.value <= 1024)) or (skip_key is not None and
> >> + skip_key.value ==
> >> key.value):
> >
> > ditto: key.value in range(0, 1024)
> > s/skip_key is not None/skip_key/?
> >
> > OTOH, from include/linux/err.h:
> > #define MAX_ERRNO 4095
> >
> > Maybe that value makes more sense although I really don't think we need
> > a check here.
>
> See above
> >> + continue
> >
> >> +
> >> + if total_cpu is None:
> >
> > if not total_cpu
> >> + total_cpu = [0] * len(cpu_array.getvalue(key))
> >> +
> >> + for cpu, value in enumerate(cpu_array.getvalue(key)):
> >> + total_cpu[cpu] += value
> >> + total += value
> >> +
> >> + if total >= 0:
> >> + for cpu, value in enumerate(total_cpu):
> >> + if value == 0:
> >> + continue
> >> +
> >> + per_cpu += " {}: {},".format(cpu, value)
> >> +
> >> + return str(total), per_cpu.strip(", ")
> >> +
> >> +
> >> +#
> >> +# Display kernel upcall statistics
> >> +#
> >> +def display_upcall_results():
> >> + upcalls = bpf.get_table("upcall_count")
> >
> > From bcc's reference: get_table()
> > Returns a table object. This is no longer used, as tables can now be
> > read as items from BPF. Eg: ```BPF[name]```.
> >
> > Not sure if this means the API will be deprecated...
>
> Changed it, let not wait for the deprecation notice ;)
>
> >> + have_upcalls = False
> >> +
> >> + for k in upcalls.keys():
> >
> > Interestingly, "keys()" is not documented as an API. Nevertheless, it's
> > not necessary, since "table" iterates over keys (same as list or dict in
> > python). So it could be written as "for k in upcalls:".
>
> Good catch; changed here and below.
>
> >> + if not (0 <= k.value <= 255) or upcalls.sum(k).value == 0:
> >
> > I think the pythonic way is "if k.value not in range(0, 255). Still how
> > would it be possible to read keys greater than 255 if it's being limited
> > at the ebpf side?
>
> See above.
>
> >> + continue
> >> + have_upcalls = True
> >> + break
> >> +
> >> + if not have_upcalls:
> >> + return
> >> +
> >> + print("\n\n# UPCALL STATISTICS:\n Total upcalls : {}
> >> [{}]".format(
> >> + *format_per_cpu_array(upcalls)))
> >> +
> >
> > When looking at theese statistics, users may take a while to figure out
> > what each number means, e.g:
> >
> > # UPCALL STATISTICS:
> > Total upcalls : 240933 [0: 33165, 2: 27495, 3: 18270, 4:
> > 15300, 5: 33885, 6: 4050, 8: 51348, 9: 33075, 10: 24345]
> > Successfull upcalls : 80538 [0: 11804, 2: 8382, 3: 5660, 4: 5716, 5:
> > 11606, 6: 2845, 8: 13717, 9: 10333, 10: 10475]
> > Failed upcalls : 161587 [0: 21361, 2: 19113, 3: 12610, 4: 9584,
> > 5: 22279, 6: 2397, 8: 37631, 9: 22742, 10: 13870]
> > 11, EAGAIN : 162528 [0: 21361, 2: 19113, 3: 12610, 4: 9584,
> > 5: 22279, 6: 3345, 8: 37631, 9: 22742, 10: 1387
> >
> > An option could be to describe the format in the header, e.g:
> >
> > # UPCALL STATISTICS (TOTAL [CPU_ID: UPCALL_PER_CPU, ...])
> > Total upcalls : 240933 [0: 33165, 2: 27495, 3: 18270, 4:
> > 15300, 5: 33885, 6: 4050, 8: 51348, 9: 33075, 10: 24345]
> > Successfull upcalls : 80538 [0: 11804, 2: 8382, 3: 5660, 4: 5716, 5:
> > 11606, 6: 2845, 8: 13717, 9: 10333, 10: 10475]
> > Failed upcalls : 161587 [0: 21361, 2: 19113, 3: 12610, 4: 9584,
> > 5: 22279, 6: 2397, 8: 37631, 9: 22742, 10: 13870]
> > 11, EAGAIN : 162528 [0: 21361, 2: 19113, 3: 12610, 4: 9584,
> > 5: 22279, 6: 3345, 8: 37631, 9: 22742, 10: 1387
>
> I went for the above, as the CPU set might not be the same in all lines, and
> I would like to condense this as much as possible (I was testing on 64 CPUs
> ;).
>
> > Another option could be to print them in a table:
> >
> > Result TOTAL CPU 0 CPU 2 CPU 3 CPU 4 CPU 5
> > CPU 6 CPU 8 CPU 9 CPU 10
> > -------------- --------- ------- ------- -------- ------- -------
> > ------- ------- -------- --------
> > TOTAL 240933 33165 27495 18270 15300 33885
> > 4050 51348 33075 24345
> > Successful 80538 11804 8382 5660 5716 11606
> > 2845 13717 10333 10475
> > Failed 161587 21361 19113 12610 9584 22279
> > 2397 37631 22742 13870
> > 11, EAGAIN 162528 21361 19113 12610 9584 22279
> > 3345 37631 22742 1387
> >
> >
> >
> >
> >> + for k in upcalls.keys():
> >
> > nit: Same as above, "for k in upcalls:"
>
> Ack
>
> >> + error = k.value
> >> + total, per_cpu = format_per_cpu_array(upcalls, key=k)
> >> +
> >> + if error != 0 and total == "0":
> >
> > Maybe return the total as integer to avoid this strange looking '== "0"'?
>
> Ack, yes was too much C-minded here always returning a string…
>
> >> + continue
> >> +
> >> + if error == 0:
> >> + print(" Successfull upcalls : {} [{}]".format(total,
> >> per_cpu))
> >> +
> >> + total, per_cpu = format_per_cpu_array(upcalls, skip_key=k)
> >> + print(" Failed upcalls : {} [{}]".format(total,
> >> per_cpu))
> >> + else:
> >> + print(" {:3}, {:13}: {} [{}]".format(error,
> >> +
> >> get_errno_short(error),
> >> + total, per_cpu))
> >> +
> >> +
> >> #
> >> # process_results()
> >> #
> >> @@ -1074,7 +1188,12 @@ def process_results(syscall_events=None,
> >> trigger_delta=None):
> >> indent, "TOTAL:", "", total_count, total_ns))
> >>
> >> #
> >> - # Print events
> >> + # Print upcall statistics
> >> + #
> >> + display_upcall_results()
> >> +
> >> + #
> >> + # Print syscall events
> >> #
> >> lost_stack_traces = 0
> >> if syscall_events:
> >> @@ -1194,6 +1313,9 @@ def main():
> >> parser.add_argument("--skip-syscall-poll-events",
> >> help="Skip poll() syscalls with --syscall-events",
> >> action="store_true")
> >> + parser.add_argument("--skip-upcall-stats",
> >> + help="Skip the collection of upcall statistics",
> >> + action="store_true")
> >> parser.add_argument("--stack-trace-size",
> >> help="Number of unique stack traces that can be "
> >> "recorded, default 4096. 0 to disable",
> >> @@ -1298,6 +1420,9 @@ def main():
> >> source = source.replace("<STACK_TRACE_ENABLED>", "true"
> >> if options.stack_trace_size > 0 else "false")
> >>
> >> + source = source.replace("<INSTALL_OVS_DP_UPCALL_PROBE>", "0"
> >> + if options.skip_upcall_stats else "1")
> >> +
> >> #
> >> # Handle start/stop probes
> >> #
> >> diff --git a/utilities/usdt-scripts/kernel_delay.rst
> >> b/utilities/usdt-scripts/kernel_delay.rst
> >> index 0f6f916a7..41620d760 100644
> >> --- a/utilities/usdt-scripts/kernel_delay.rst
> >> +++ b/utilities/usdt-scripts/kernel_delay.rst
> >> @@ -109,6 +109,7 @@ followed by resource-specific data. Which are:
> >> - ``THREAD STOPPED STATISTICS``
> >> - ``HARD IRQ STATISTICS``
> >> - ``SOFT IRQ STATISTICS``
> >> +- ``UPCALL STATISTICS``
> >>
> >> The following sections will describe in detail what statistics they
> >> report.
> >>
> >> @@ -187,6 +188,14 @@ number of interrupts (``COUNT``), the total time
> >> spent in the interrupt
> >> handler (``TOTAL ns``), and the worst-case duration (``MAX ns``).
> >>
> >>
> >> +``UPCALL STATISTICS``
> >> +~~~~~~~~~~~~~~~~~~~~~
> >> +The ``UPCALL STATISTICS`` section provides information on the number of
> >> +upcalls sent by the kernel to userspace. If any upcalls fail to be sent,
> >> +the specific return codes are recorded. Statistics are presented both as
> >> +a total and on a per-CPU basis.
> >> +
> >> +
> >> The ``--syscall-events`` option
> >> -------------------------------
> >> In addition to reporting global syscall statistics in
> >> ``SYSCALL_STATISTICS``,
> >>
> >> _______________________________________________
> >> dev mailing list
> >> [email protected]
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev