Hi, i debugged further and build a version with debugging symbols and put an assert into the code where we detect the out of sequence netlink message. My understanding is that we do an krt_if_scan on a regular basis - Now we go through the notification chain and end up sending another netlink message before krt_if_scan pulled all netlink messages. nl_get_reply gets an out-of-sequence netlink message and drops it although nl_get_scan would need it for an end-of-messages marker. Thus returning from our notification chain nl_get_scan ends up polling for the next message which already got removed via nl_send_route -> nl_exchange -> nl_get_reply as an out-of-sequence. As the netlink FD is blocking we deadlock here.
One solution i see would be to convert all nl_get_scan users to first poll ALL messages before starting to process them which could mean a lot of memory usage especially on full routing BGP where we need to poll all routes first. Another solution would be to make some generic polling/callback based approach, or probably replacing the netlink.c with an API wrapper around libnl (http://people.suug.ch/~tgr/libnl/) (gdb) bt #0 0xb7e3e947 in raise () from /lib/tls/libc.so.6 #1 0xb7e400c9 in abort () from /lib/tls/libc.so.6 #2 0xb7e3805f in __assert_fail () from /lib/tls/libc.so.6 #3 0x08078a09 in nl_get_reply () at netlink.c:135 #4 0x08078b0d in nl_exchange (pkt=0xbfe3f2b4) at netlink.c:193 #5 0x08079562 in nl_send_route (p=0x80905a0, e=0x809921c, new=0) at netlink.c:542 #6 0x080795d2 in krt_set_notify (p=0x80905a0, n=0x80981c4, new=0x0, old=0x809921c) at netlink.c:561 #7 0x0807614e in krt_notify (P=0x80905a0, net=0x80981c4, new=0x0, old=0x809921c, attrs=0x0) at krt.c:698 #8 0x08049910 in do_rte_announce (a=0x8095e40, net=0x80981c4, new=0x80991cc, old=0x809921c, tmpa=0x0, class=4100) at /home/flo/p/root/bird-1.0.11/./nest/rt-table.c:227 #9 0x08049521 in rte_announce (tab=0x808f4b0, net=0x80981c4, new=0x80991cc, old=0x809921c, tmpa=0x0) at /home/flo/p/root/bird-1.0.11/./nest/rt-table.c:261 #10 0x08049b94 in rte_recalculate (table=0x808f4b0, net=0x80981c4, p=0x80906c8, new=0x80991cc, tmpa=0x0) at /home/flo/p/root/bird-1.0.11/./nest/rt-table.c:368 #11 0x08049ff1 in rte_update (table=0x808f4b0, net=0x80981c4, p=0x80906c8, new=0x80991cc) at /home/flo/p/root/bird-1.0.11/./nest/rt-table.c:514 #12 0x0804fd41 in dev_ifa_notify (p=0x80906c8, c=1, ad=0x8095c80) at /home/flo/p/root/bird-1.0.11/./nest/rt-dev.c:69 #13 0x0804e962 in ifa_send_notify (p=0x80906c8, c=1, a=0x8095c80) at /home/flo/p/root/bird-1.0.11/./nest/iface.c:148 #14 0x0804e87e in ifa_notify_change (c=1, a=0x8095c80) at /home/flo/p/root/bird-1.0.11/./nest/iface.c:159 #15 0x0804ea7d in if_notify_change (c=1, i=0x8095b38) at /home/flo/p/root/bird-1.0.11/./nest/iface.c:211 #16 0x0804ed24 in if_update (new=0xbfe3f5ec) at /home/flo/p/root/bird-1.0.11/./nest/iface.c:280 #17 0x08078f5c in nl_parse_link (h=0x80919b8, scan=1) at netlink.c:334 #18 0x080792a1 in krt_if_scan (p=0x8090778) at netlink.c:445 #19 0x08074fd5 in kif_scan (t=0x8095af8) at krt.c:94 #20 0x08075004 in kif_force_scan () at krt.c:102 #21 0x08076018 in krt_scan (t=0x8095a48) at krt.c:655 #22 0x08073005 in tm_shot () at io.c:298 #23 0x08074886 in io_loop () at io.c:1126 #24 0x080774ef in main (argc=Cannot access memory at address 0x67ec ) at main.c:462 -- Florian Lohoff [EMAIL PROTECTED] +49-171-2280134 Those who would give up a little freedom to get a little security shall soon have neither - Benjamin Franklin
signature.asc
Description: Digital signature

