On 13 Dec 2024, at 13:56, Adrián Moreno wrote:

> On Mon, Nov 25, 2024 at 03:14:20PM +0100, Eelco Chaudron wrote:
>> This patch installs a kernel return probe on ovs_dp_upcall() to record
>> all successful and failed calls per CPU. The statistics are reported
>> when the script exits, providing insights into the upcall behavior.
>>
>> This is an example output:
>>
>>   # UPCALL STATISTICS:
>>     Total upcalls       : 183247 [0: 8937, 2: 14464, 4: 10372, 6: 4254, ...]
>>     Successfull upcalls : 120195 [0: 5945, 2: 6379, 4: 5966, 6: 4254, ...]
>>     Failed upcalls      : 63052 [0: 2992, 2: 8085, 4: 4406, 8: 2275, ...]
>>        11, EAGAIN       : 63052 [0: 2992, 2: 8085, 4: 4406, 8: 2275, ...]
>>
>> Signed-off-by: Eelco Chaudron <[email protected]>
>
> Thanks for this patch. Correlating upcall failures to kernel delays is
> extremely useful. Some comments below.

Thanks Adrian for the review. I’ll send out a v2 in a minute, but see my 
responses below.

//Eelco

>> ---
>>  utilities/automake.mk                   |    1
>>  utilities/usdt-scripts/kernel_delay.py  |  127 
>> +++++++++++++++++++++++++++++++
>>  utilities/usdt-scripts/kernel_delay.rst |    9 ++
>>  3 files changed, 136 insertions(+), 1 deletion(-)
>>
>> diff --git a/utilities/automake.mk b/utilities/automake.mk
>> index acc1af4e0..22260b254 100644
>> --- a/utilities/automake.mk
>> +++ b/utilities/automake.mk
>> @@ -149,6 +149,7 @@ FLAKE8_PYFILES += utilities/ovs-pcap.in \
>>      utilities/ovs-pipegen.py \
>>      utilities/usdt-scripts/dpif_op_nl_monitor.py \
>>      utilities/usdt-scripts/flow_reval_monitor.py \
>> +    utilities/usdt-scripts/kernel_delay.py \
>>      utilities/usdt-scripts/upcall_monitor.py \
>>      utilities/usdt-scripts/upcall_cost.py
>>
>> diff --git a/utilities/usdt-scripts/kernel_delay.py 
>> b/utilities/usdt-scripts/kernel_delay.py
>> index de6b0c9de..a5aa1a3d4 100755
>> --- a/utilities/usdt-scripts/kernel_delay.py
>> +++ b/utilities/usdt-scripts/kernel_delay.py
>> @@ -39,6 +39,7 @@
>>  #
>>  import argparse
>>  import datetime
>> +import errno
>>  import os
>>  import pytz
>>  import psutil
>> @@ -556,6 +557,30 @@ TRACEPOINT_PROBE(irq, softirq_exit)
>>      data->start_ns = 0;
>>      return 0;
>>  }
>> +
>> +
>> +/*
>> + * For measuring upcall statistics (per CPU).
>> + */
>> +#define MAX_ERROR 255
>
> Why this limit? The kernel seems to use 4095 in the PTR_ERR
> infrastructure. Although not exposed to user applications,
> kernel has some > 255 values. They don't seem networking related for
> now but still.
>
> In order to reduce this arbitrary limit and avoid needless iteration
> over an almost empty array, I'd make it a percpu_hash. It supports
> "lookup_or_try_init" and "increment" so it's as easy to use from ebpf
> and almost identically from python.

Good point, I will use a per-CPU hash table instead.

>> +BPF_TABLE("percpu_array", uint32_t, uint64_t, upcall_count, MAX_ERROR + 1);
>> +
>> +#if <INSTALL_OVS_DP_UPCALL_PROBE>
>> +int kretprobe__ovs_dp_upcall(struct pt_regs *ctx)
>> +{
>
> When trying the script I was a bit confused that numbers didn't add up,
> e.g:
>
> # UPCALL STATISTICS:
>   Total upcalls       : 776234 [0: 204345, 1: 38773, 2: 38206, 3:
> 38700, 4: 104040, 5: 77400, 6: 70290, 8: 69705, 9: 67410, 10: 14220,
> 11: 53145]
>   Successfull upcalls : 268640 [0: 72829, 1: 14556, 2: 12876, 3:
> 13192, 4: 34227, 5: 26480, 6: 24130, 8: 23791, 9: 23372, 10: 6164, 11:
> 17023]
>   Failed upcalls      : 508564 [0: 131516, 1: 25187, 2: 25330, 3:
> 25508, 4: 69813, 5: 50920, 6: 46160, 8: 45914, 9: 44038, 10: 8056, 11:
> 36122]
>      11, EAGAIN       : 509485 [0: 131516, 1: 26112, 2: 25330, 3:
> 25508, 4: 69813, 5: 50920, 6: 46160, 8: 45914, 9: 44038, 10: 8056, 11:
> 36122]
>
> Total of EAGAIN erros != total Failed and
> Total Failed + Total Successfull != Total.
>
> I believe the reason is that events of this kind are still being
> triggered while results are being processed. The problem seems to go
> away if I add the following lines at the top of this probe:
>
> if (!capture_enabled__())
>     return 0;

ACK, I will fix this, including resetting the values between runs…

>> +    int ret = PT_REGS_RC(ctx);
>> +
>> +    if (ret > 0)
>> +        ret = 0;
>> +
>> +    ret = -ret;
>> +    if (ret > MAX_ERROR)
>> +        ret = MAX_ERROR;
>> +
>> +    upcall_count.increment(ret);
>> +    return 0;
>> +}
>> +#endif /* For measuring upcall statistics/errors. */
>>  """
>>
>>
>> @@ -900,6 +925,95 @@ def print_timestamp(msg):
>>      print(time_string)
>>
>>
>> +#
>> +# Get errno short string
>> +#
>> +def get_errno_short(err):
>> +    try:
>> +        return errno.errorcode[err]
>> +    except KeyError:
>> +        return "_unknown_"
>> +
>> +
>> +#
>> +# Format a bfd per cpu array entry per cpu (if the count is > 0)
>> +#
>
> s/bfd/bpf?
> Also, "per cpu" twice makes the sentense a bit difficult to read.

Ack

>
>> +def format_per_cpu_array(cpu_array, key=None, skip_key=None):
>> +    per_cpu = ""
>> +
>> +    if key is not None:
>
> "is not None" is redundant: "if key:" is equivalent.

This and the one below are not the same for my code; ‘if key:’ behaves 
incorrectly:

# UPCALL STATISTICS (TOTAL [CPU_ID: UPCALL_PER_CPU, ...]):
  Total upcalls       : 182611 [0: 14506, 2: 13830, 4: 10348, 6: 8752, 8: 
8930,...]
  Successfull upcalls : 182611 [0: 14506, 2: 13830, 4: 10348, 6: 8752, 8: 
8930,...]
  Failed upcalls      : 182611 [0: 14506, 2: 13830, 4: 10348, 6: 8752, 8: 
8930,...]
     11, EAGAIN       : 59526 [0: 7701, 2: 7580, 4: 4443, 6: 2745, 8: 3013,...]

vs ‘if key is not None’:

# UPCALL STATISTICS (TOTAL [CPU_ID: UPCALL_PER_CPU, ...]):
  Total upcalls       : 209023 [0: 16271, 2: 14830, 4: 11787, 6: 10025, 8: 
4875,...]
  Successfull upcalls : 127677 [0: 6342, 2: 6250, 4: 5547, 6: 5464, 8: 4875,...]
  Failed upcalls      : 81346 [0: 9929, 2: 8580, 4: 6240, 6: 4561, 10: 4602,...]
     11, EAGAIN       : 81346 [0: 9929, 2: 8580, 4: 6240, 6: 4561, 10: 4602,...]


>> +        total = cpu_array.sum(key).value
>> +        if total > 0:
>> +            for cpu, value in enumerate(cpu_array.getvalue(key)):
>> +                if value == 0:
>> +                    continue
>> +
>> +                per_cpu += " {}: {},".format(cpu, value)
>> +    else:
>> +        total = 0
>> +        total_cpu = None
>> +
>> +        for key in cpu_array.keys():
>
> See below.
>
>> +            # I've seen extrem odd hight values, to avoid those, exclude any
>
> typo? "extremely odd high values"?
>
>> +            # index higher thant 1024 for now.
>
> What values you were getting? I'm not seing any weird value.
> Besides, isn't the maximum set in the ebpf side to 255?
> Plus, it's an array so its size is limitted and no key can exceed 256
> right?

This was a while back, I cant remember if it was a specific kernel or BCC 
version.
I would get indexes that look like integer overlows. However with the hash 
implementation this is not longer requered.

>> +            if (not (0 <= key.value <= 1024)) or (skip_key is not None and
>> +                                                  skip_key.value == 
>> key.value):
>
> ditto: key.value in range(0, 1024)
> s/skip_key is not None/skip_key/?
>
> OTOH, from include/linux/err.h:
> #define MAX_ERRNO     4095
>
> Maybe that value makes more sense although I really don't think we need
> a check here.

See above
>> +                continue
>
>> +
>> +            if total_cpu is None:
>
> if not total_cpu
>> +                total_cpu = [0] * len(cpu_array.getvalue(key))
>> +
>> +            for cpu, value in enumerate(cpu_array.getvalue(key)):
>> +                total_cpu[cpu] += value
>> +                total += value
>> +
>> +        if total >= 0:
>> +            for cpu, value in enumerate(total_cpu):
>> +                if value == 0:
>> +                    continue
>> +
>> +                per_cpu += " {}: {},".format(cpu, value)
>> +
>> +    return str(total), per_cpu.strip(", ")
>> +
>> +
>> +#
>> +# Display kernel upcall statistics
>> +#
>> +def display_upcall_results():
>> +    upcalls = bpf.get_table("upcall_count")
>
> From bcc's reference: get_table()
> Returns a table object. This is no longer used, as tables can now be
> read as items from BPF. Eg: ```BPF[name]```.
>
> Not sure if this means the API will be deprecated...

Changed it, let not wait for the deprecation notice ;)

>> +    have_upcalls = False
>> +
>> +    for k in upcalls.keys():
>
> Interestingly, "keys()" is not documented as an API. Nevertheless, it's
> not necessary, since "table" iterates over keys (same as list or dict in
> python). So it could be written as "for k in upcalls:".

Good catch; changed here and below.

>> +        if not (0 <= k.value <= 255) or upcalls.sum(k).value == 0:
>
> I think the pythonic way is "if k.value not in range(0, 255). Still how
> would it be possible to read keys greater than 255 if it's being limited
> at the ebpf side?

See above.

>> +            continue
>> +        have_upcalls = True
>> +        break
>> +
>> +    if not have_upcalls:
>> +        return
>> +
>> +    print("\n\n# UPCALL STATISTICS:\n  Total upcalls       : {} 
>> [{}]".format(
>> +        *format_per_cpu_array(upcalls)))
>> +
>
> When looking at theese statistics, users may take a while to figure out
> what each number means, e.g:
>
> # UPCALL STATISTICS:
>   Total upcalls       : 240933 [0: 33165, 2: 27495, 3: 18270, 4:
> 15300, 5: 33885, 6: 4050, 8: 51348, 9: 33075, 10: 24345]
>   Successfull upcalls : 80538 [0: 11804, 2: 8382, 3: 5660, 4: 5716, 5:
> 11606, 6: 2845, 8: 13717, 9: 10333, 10: 10475]
>   Failed upcalls      : 161587 [0: 21361, 2: 19113, 3: 12610, 4: 9584,
> 5: 22279, 6: 2397, 8: 37631, 9: 22742, 10: 13870]
>      11, EAGAIN       : 162528 [0: 21361, 2: 19113, 3: 12610, 4: 9584,
> 5: 22279, 6: 3345, 8: 37631, 9: 22742, 10: 1387
>
> An option could be to describe the format in the header, e.g:
>
> # UPCALL STATISTICS (TOTAL [CPU_ID: UPCALL_PER_CPU, ...])
>   Total upcalls       : 240933 [0: 33165, 2: 27495, 3: 18270, 4:
> 15300, 5: 33885, 6: 4050, 8: 51348, 9: 33075, 10: 24345]
>   Successfull upcalls : 80538 [0: 11804, 2: 8382, 3: 5660, 4: 5716, 5:
> 11606, 6: 2845, 8: 13717, 9: 10333, 10: 10475]
>   Failed upcalls      : 161587 [0: 21361, 2: 19113, 3: 12610, 4: 9584,
> 5: 22279, 6: 2397, 8: 37631, 9: 22742, 10: 13870]
>      11, EAGAIN       : 162528 [0: 21361, 2: 19113, 3: 12610, 4: 9584,
> 5: 22279, 6: 3345, 8: 37631, 9: 22742, 10: 1387

I went for the above, as the CPU set might not be the same in all lines, and I 
would like to condense this as much as possible (I was testing on 64 CPUs ;).

> Another option could be to print them in a table:
>
>      Result       TOTAL    CPU 0   CPU 2   CPU 3    CPU 4   CPU 5
> CPU 6   CPU 8   CPU 9    CPU 10
>  -------------- --------- ------- ------- -------- ------- -------
> ------- ------- -------- --------
>   TOTAL           240933   33165   27495    18270   15300   33885
> 4050   51348    33075    24345
>    Successful     80538    11804    8382     5660    5716   11606
> 2845   13717    10333    10475
>    Failed        161587    21361   19113    12610    9584   22279
> 2397   37631    22742    13870
>     11, EAGAIN   162528    21361   19113    12610    9584   22279
> 3345   37631    22742     1387
>
>
>
>
>> +    for k in upcalls.keys():
>
> nit: Same as above, "for k in upcalls:"

Ack

>> +        error = k.value
>> +        total, per_cpu = format_per_cpu_array(upcalls, key=k)
>> +
>> +        if error != 0 and total == "0":
>
> Maybe return the total as integer to avoid this strange looking '== "0"'?

Ack, yes was too much C-minded here always returning a string…

>> +            continue
>> +
>> +        if error == 0:
>> +            print("  Successfull upcalls : {} [{}]".format(total, per_cpu))
>> +
>> +            total, per_cpu = format_per_cpu_array(upcalls, skip_key=k)
>> +            print("  Failed upcalls      : {} [{}]".format(total, per_cpu))
>> +        else:
>> +            print("    {:3}, {:13}: {} [{}]".format(error,
>> +                                                    get_errno_short(error),
>> +                                                    total, per_cpu))
>> +
>> +
>>  #
>>  # process_results()
>>  #
>> @@ -1074,7 +1188,12 @@ def process_results(syscall_events=None, 
>> trigger_delta=None):
>>                  indent, "TOTAL:", "", total_count, total_ns))
>>
>>      #
>> -    # Print events
>> +    # Print upcall statistics
>> +    #
>> +    display_upcall_results()
>> +
>> +    #
>> +    # Print syscall events
>>      #
>>      lost_stack_traces = 0
>>      if syscall_events:
>> @@ -1194,6 +1313,9 @@ def main():
>>      parser.add_argument("--skip-syscall-poll-events",
>>                          help="Skip poll() syscalls with --syscall-events",
>>                          action="store_true")
>> +    parser.add_argument("--skip-upcall-stats",
>> +                        help="Skip the collection of upcall statistics",
>> +                        action="store_true")
>>      parser.add_argument("--stack-trace-size",
>>                          help="Number of unique stack traces that can be "
>>                          "recorded, default 4096. 0 to disable",
>> @@ -1298,6 +1420,9 @@ def main():
>>      source = source.replace("<STACK_TRACE_ENABLED>", "true"
>>                              if options.stack_trace_size > 0 else "false")
>>
>> +    source = source.replace("<INSTALL_OVS_DP_UPCALL_PROBE>", "0"
>> +                            if options.skip_upcall_stats else "1")
>> +
>>      #
>>      # Handle start/stop probes
>>      #
>> diff --git a/utilities/usdt-scripts/kernel_delay.rst 
>> b/utilities/usdt-scripts/kernel_delay.rst
>> index 0f6f916a7..41620d760 100644
>> --- a/utilities/usdt-scripts/kernel_delay.rst
>> +++ b/utilities/usdt-scripts/kernel_delay.rst
>> @@ -109,6 +109,7 @@ followed by resource-specific data. Which are:
>>  - ``THREAD STOPPED STATISTICS``
>>  - ``HARD IRQ STATISTICS``
>>  - ``SOFT IRQ STATISTICS``
>> +- ``UPCALL STATISTICS``
>>
>>  The following sections will describe in detail what statistics they report.
>>
>> @@ -187,6 +188,14 @@ number of interrupts (``COUNT``), the total time spent 
>> in the interrupt
>>  handler (``TOTAL ns``), and the worst-case duration (``MAX ns``).
>>
>>
>> +``UPCALL STATISTICS``
>> +~~~~~~~~~~~~~~~~~~~~~
>> +The ``UPCALL STATISTICS`` section provides information on the number of
>> +upcalls sent by the kernel to userspace. If any upcalls fail to be sent,
>> +the specific return codes are recorded. Statistics are presented both as
>> +a total and on a per-CPU basis.
>> +
>> +
>>  The ``--syscall-events`` option
>>  -------------------------------
>>  In addition to reporting global syscall statistics in 
>> ``SYSCALL_STATISTICS``,
>>
>> _______________________________________________
>> dev mailing list
>> [email protected]
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to