> > Paul, > > Thanks for the advice; we ran memtest as well as the Dell complete system > diagnostic and neither found an issue. The plot thickens, though! > > Our admins messed up our kickstart labels and what I *thought* was CentOS > 6.4 was actually RHEL 6.4 and the problem seems to be following the CentOS > 6.4 installations -- the current configuration of success/failure is: > 1 server - Westmere - RHEL 6.4 -- works > 1 server - Sandy Bridge - RHEL 6.4 -- works > 2 servers - Sandy Bridge - CentOS 6.4 -- fails > > Given that the hardware seems otherwise stable/checks out I'm trying to > figure out how to determine if this is: > a) our software has a bug > b) a kernel/hugetlbfs bug > c) a DPDK 1.6.0r2 bug > > I have seen similar issues where calling rte_eal_init too late in a process > also > causes similar issues (things like calling 'free' on memory that was allocated > with 'malloc' before 'rte_eal_init' is called fails/results in segfault in > libc) > which seems odd to me but in this case we are calling rte_eal_init as the > first > thing we do in main().
I have seen the following issues causing mbuf corruption of this type 1. Calling an rte_pktmbuf_free() on an mbuf and then still using a reference to that mbuf. 2. Using rte_pktmbuf_free() and rte_pktmbuf_alloc() in a pthread (i.e. not a "dpdk" thread). This corrupted the per-lcore mbuf cache. Not pleasant to debug, especially if you are sharing the mempool between primary and secondary processes. I have no tips for debug other than careful code review everywhere an mbuf is freed or allocated. Mark