[Public]

Hi Stepehen,

Thank you for sharing

Snipped

>
>
> On Fri,  8 Aug 2025 07:49:09 -0400
> Khadem Ullah <14pwcse1...@uetpeshawar.edu.pk> wrote:
>
> > The crashes are on 22.11, 23.03, 24.11, it is on all dpdk stable versions 
> > and 25.07
> as well.
> > Please first close primary testpmd before secondary testpmd
> > application and try to close secondary or execute any of the following
> > commands,
> >
> > "show device info all
> > show port stats all
> > show port xstats all
> > set fwd rxonly
> > set fwd txonly
> > start
> > etc"
> >
> > We are all agree that these crashes exists. First we were tried to
> > prevent the crashes at PMD level, but it was not possible to add
> > checks in each PMD. Then we tried to add safety checks in ethdev
> > layer, and it was not suitable as with primary closing all reference
> > to device information (pointers) would lead crashes.
> >
> > Then we agreed on secondary process monitoring for primary process exiting.
> > and it is now resolved on application level, i.e. on testpmd.
> >
> > Now, this solution is working perfectly. We can add eal_cleanup for
> > gracefull exit.
> >
> > Best Regards,
> > Khadem
>
> Maybe this quick picture would help explain the data structures
>
>                                │
>                                │            Huge pages (shared)
>                                │
>             rte_eth_devices[]  │
>                                │
>              ┌────────┐        │
> Primary      │        ┼────────┼───┐
> Process      │        │        │   │
>              ┌────────┐        │   │
>              │        ┼────┐   │   │
>              │        │    │   │   │              rte_eth_dev_data
>              └────────┘    │   │   │            ┌─────────────────┐
>                            │   │   │            │                 │
>                            │   │   └───────────►│                 │
>                            │   │                │              
> ───┼─────────────►
>                            │   │        ┌───────►                 │
>                            │   │        │       │                 │
>                            │   │        │       └─────────────────┘
>                            │   │        │
>                            │   │        │       ┌─────────────────┐
>                            │   │        │       │                 │
>                            └───┼────────┼───────►                 │
>                                │        │       │            
> ─────┼────────────►
>             rte_eth_devices    │        │  ┌────►                 │
>   Secondary ┌────────┐         │        │  │    │                 │
>   Process   │        ┼─────────┼────────┘  │
> └─────────────────┘
>             │        │         │           │
>             ┌────────┐         │           │
>             │        ┼─────────┼───────────┘
>             │        │         │
>             └────────┘         │
>                                │

Definitely something in way `rte_eth_Dev_data` is changed in some release.
Earlier when secondary comes up, the memory shared for rte_eth_dev_data were 
probed by secondary to get physical device into local memory. Then all virtual 
devices under secondary were added.

Hence adding physical or virtual device in primary did not reflect back to 
secondary after rte_eal_inti completed by secondary.
I am still trying to figure out why @Khadem Ullah mentioned

```
Please first close primary testpmd before secondary testpmd application and try 
to close secondary or execute any of the following commands,

"show device info all
show port stats all
show port xstats all
set fwd rxonly
set fwd txonly
start
etc"
```

`I think the reason is because, in all the testing we use SIGKILL to kill the 
primary, and not close or shutdown which will trigger the cleanup.
The thumb rule or understanding, if you are shutting down primary always shut 
down secondaries first.`

@Khadem Ullah I am open to be available to slack for you where you can recreate 
the issue (as you have working setup) with the version deployed.
Please do let me know.

Regards
Vipin Varghese

Reply via email to