On Thu, Mar 08, 2018 at 02:48:01PM +0800, aogooc xu wrote:
> More debugging information ...
>
> (gdb) f 2
> #2 process_runnable_tasks () at src/task.c:229
> 229 rq_next = eb32_next(rq_next);
> (gdb) print rq_next
> $1 = (struct eb32_node *) 0x2a94840
> (gdb) print rq_next->node
> $2 = {branches = {b = {0x5d903c0, 0x2a94840}}, node_p = 0x0, leaf_p = 0x0,
> bit = 3, pfx = 681}
So the memory is corrupted, as the next node in the tree is not in the tree!
That's something that's normally not structurally possible in ebtrees, so
maybe some memory has been overwritten somewhere else (overflow somewhere,
I don't know). What surprises me is that 1.6 is now quite old and such an
issue has never ever been reported. Since you're saying it's starting to
happen more and more often, I suspect that it could also be one rare case
of hardware issue (eg: defective RAM stick). If you can switch the traffic
to the backup node, you'll see if the problem happens again, and in the
mean time that will leave your master node available to run a memtest.
I'm not saying it's necessarily this, I'm just guessing, and such a test
could be realtively easy.
Thanks,
Willy