On Fri, Jun 30, 2023 at 12:33 PM Ilya Maximets <[email protected]> wrote:
>
> On 6/30/23 18:13, Ilya Maximets wrote:
> > On 6/30/23 00:45, Mike Pattrick wrote:
> >> On Thu, Jun 29, 2023 at 3:10 PM Ilya Maximets <[email protected]> wrote:
> >>>
> >>> On 6/9/23 17:05, Mike Pattrick wrote:
> >>>> Several xlate actions used in recursive translation currently store a
> >>>> large amount of information on the stack. This can result in handler
> >>>> threads quickly running out of stack space despite before
> >>>> xlate_resubmit_resource_check() is able to terminate translation. This
> >>>> patch reduces stack usage by over 3kb from several translation actions.
> >>>
> >>> I'm assuming that this number is mostly about patch_port_output function,
> >>> right?  I see only ~30% stack usage reduction for the main recursive
> >>> do_xlate_actions function.  It went from 720 to 512 bytes.
> >>>
> >>> Or am I calculating it wrong?  I'm using -fstack-size compiler option.
> >>
> >> I was using GDB for this, finding the difference in the stack pointer
> >> between calls to the same function during recursion. The intention of
> >> this patch is to get to xlate_resubmit_resource_check() without
> >> smashing the stack, no individual functions stack usage is as
> >> important as the holistic stack usage while recursing.
> >>
> >> For example, without patch:
> >>
> >> (gdb) fram 1
> >> #1  0x000000000044dee3 in xlate_table_action (...) at
> >> ofproto/ofproto-dpif-xlate.c:4539
> >> 4539        if (xlate_resubmit_resource_check(ctx)) {
> >> (gdb) p $rsp
> >> $4 = (void *) 0x7f888e7aa700
> >> (gdb) frame 8
> >> #8  xlate_table_action (...) at ofproto/ofproto-dpif-xlate.c:4584
> >> 4584                xlate_recursively(ctx, rule, table_id <= old_table_id,
> >> (gdb) p $rsp
> >> $5 = (void *) 0x7f888e7aba70
> >> (gdb) p 0x7f888e7aba70-0x7f888e7aa700
> >> $6 = 4976
> >>
> >> With patch:
> >>
> >> #1  0x000000000044e3bc in xlate_table_action (...) at
> >> ofproto/ofproto-dpif-xlate.c:4552
> >> 4552        if (xlate_resubmit_resource_check(ctx)) {
> >> (gdb) p $rsp
> >> $4 = (void *) 0x7f78e69a9860
> >> (gdb) frame 9
> >> #9  0x000000000044ec00 in xlate_table_action (...) at
> >> ofproto/ofproto-dpif-xlate.c:7212
> >> 7212                xlate_table_action(ctx, 
> >> ctx->xin->flow.in_port.ofp_port,
> >> (gdb) p $rsp
> >> $5 = (void *) 0x7f78e69a9c60
> >> (gdb) p 0x7f78e69a9c60-0x7f78e69a9860
> >> $6 = 1024
> >>
> >>
> >>>
> >>>> This patch also moves some trace function from do_xlate_actions into its
> >>>> own function to reduce stack usage.
> >>>
> >>> This doesn't seem to work.  With GCC and -O2, i.e. the standard 
> >>> optimization
> >>> level, this new function is always inlined for me, and there is no real
> >>> difference in the produced code and the stack usage.
> >>
> >> When I compile with O2, I get the following code generation:
> >>
> >> Without patch:
> >> 000000000044cda0 <do_xlate_actions>:
> >> [...]
> >>   44cdad:       48 81 ec 78 02 00 00    sub    rsp,0x278
> >>
> >> With patch:
> >> 000000000044cff0 <do_xlate_actions>:
> >> [...]
> >>   44cffd:       48 81 ec 98 01 00 00    sub    rsp,0x198
> >>
> >> Do you get different results?
> >
> > If we're just looking at the first sub in the function, I get the same
> > number (0x278) on current master and 0x148 with your patch applied.
> >
> > Moving the port name map to heap reduces the value to 0x128.
> > The xlate_trace function is fully inlined according to objdump.>
> >>
> >>>
> >>> In order to achieve actual reduction of stack usage there we need to move
> >>> to dynamic allocation of the port name map.q
> >>>
> >>> There are few more functions that are fully or partially inlined into
> >>> do_xlate_actions with default compiler options:
> >>>
> >>>  - xlate_output_reg_action: Allocates a large mf_subvalue on stack.
> >>>  - xlate_sample_action: Allocates huge struct flow_tnl(!) on stack.
> >>>  - compose_conntrack_action: Allocates mf_subvalue.
> >>>  - xlate_check_pkt_larger: Allocates mf_subvalue as well.
> >>>
> >>> With these moved to heap, it should be possible to reduce the stack usage
> >>> by about 60% for do_xlate_actions function.
> >>
> >> These don't inline for me on O2. What GCC version are you using?
> >
> > If I do changes in the functions above, the number is dropped to 0xd8.
> >
> > I'm building with gcc (GCC) 13.1.1 20230511 (Red Hat 13.1.1-2) on F38.
> > CFLAGS="-g -O2".
>
> Did a round of tests with clang version 16.0.4 (Fedora 16.0.4-1.fc38)
> and the same CFLAGS.  It's not so aggressive with inlining, so it doesn't
> inline the xlate_trace.  However, that doesn't affect the initial stack
> allocation.
>
> Master    : 0x258
> With patch: 0x258  - No difference with master

That isn't fantastic! I've confirmed the same behaviour with clang.

I'll send in a new version.

-M

>
> With extra changes in functions listed above: 0x128
>
> >
> > Best regards, Ilya Maximets.
>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to