Hi, sorry for the delay. I tested now with tip.tar.bz2 > please reproduce with most recent heartbeat, > http://hg.linux-ha.org/heartbeat-STABLE_3_0/ > (http:// ... /archive/tip.tar.bz2) > > There have been changes in the cleanup path for dead nodes. > > > You did clone one into the other, > or did a clean, independend install of both? No clones. Both machines are set up with the installer. > Pacemaker version? > Glue version? Pacemaker is 1.0.9.1+hg15626-1 (from Debian 6.0.3) Glue is 1.0.6 (from Debian 6.0.3) running on Kernel 2.6.32
> How, exactly? Is this "symmetrical" or "asymmetrical", i.e. does that > block only incoming/only outgoing/both? Have both nodes seen the > respective other as dead? > Does that block both links at the same time, or one after the other? I now tested both. Symetrical and asymetrical, delay 2sec. One time the error occured after disabling/enabling asymetrical. But not every time the error occurs. Both nodes show the other offline. > Membership only sees _one_ node here, still. > nodes=1; this somehow looks like stale data. > > What does /usr/lib/heartbeat/ccm_test_client say > during such an experiment? both nodes? It shows: ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NODES IN THE PRIMARY MEMBERSHIP ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: nodeid=0, uname=debian60-clnode1, born=514 ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: MY NODE IS A MEMBER OF THE MEMBERSHIP LIST ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NEW MEMBERS ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NONE ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: MEMBERS LOST ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NONE ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: ----------------------- ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: no mbr_track info ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: instance=515, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: event=NEW MEMBERSHIP: ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: instance=515 # ttl members=1, ttl_idx=0 # new members=0, new_idx=1 # out members=0, out_idx=3 ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NODES IN THE PRIMARY MEMBERSHIP ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: nodeid=0, uname=debian60-clnode1, born=515 ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: MY NODE IS A MEMBER OF THE MEMBERSHIP LIST ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NEW MEMBERS ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NONE ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: MEMBERS LOST ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NONE ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: ----------------------- ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: no mbr_track info ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: instance=516, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: event=NEW MEMBERSHIP: ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: instance=516 # ttl members=1, ttl_idx=0 # new members=0, new_idx=1 # out members=0, out_idx=3 ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NODES IN THE PRIMARY MEMBERSHIP ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: nodeid=0, uname=debian60-clnode1, born=516 ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: MY NODE IS A MEMBER OF THE MEMBERSHIP LIST ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NEW MEMBERS ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NONE ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: MEMBERS LOST ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NONE ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: ----------------------- ./ccm_testclient[27359]: 2011/11/17_12:53:53 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm ./ccm_testclient[27359]: 2011/11/17_12:53:53 info: mem_handle_event: no mbr_track info ./ccm_testclient[27359]: 2011/11/17_12:53:53 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm ./ccm_testclient[27359]: 2011/11/17_12:53:53 info: mem_handle_event: instance=517, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3 ./ccm_testclient[27359]: 2011/11/17_12:53:53 info: event=NEW MEMBERSHIP: ./ccm_testclient[27359]: 2011/11/17_12:53:53 info: instance=517 # ttl members=1, ttl_idx=0 # new members=0, new_idx=1 # out members=0, out_idx=3 . . . > (or lib64, may be in the -dev package) > btw, as long as it can talk to ccm, > that does not terminate by itself, > you need to ctrl-c it... > > I don't think so. > Anything about lost packets? > > If you can "easily" reproduce (with most recent heartbeat), > a tcpdump may be useful from, say, 10 heartbeats before you disable, > to half a minute minute after you re-enable the network link. > (or just crank up debuggin so high that you even see message dumps in the > logs...) > > I have to sniff. In this case when the nodes show each other offline, I disconnected and reconnected the interfaces again. After a few seconds they show each other online. After that ccm_testclient shows: ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: mem_handle_event: instance=574, nodes=2, new=2, lost=0, n_idx=0, new_idx=0, old_idx=4 ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: event=NEW MEMBERSHIP: ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: instance=574 # ttl members=2, ttl_idx=0 # new members=2, new_idx=0 # out members=0, out_idx=4 ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: NODES IN THE PRIMARY MEMBERSHIP ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=1, uname=debian60-clnode2, born=1 ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=0, uname=debian60-clnode1, born=574 ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: MY NODE IS A MEMBER OF THE MEMBERSHIP LIST ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: NEW MEMBERS ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=1, uname=debian60-clnode2, born=1 ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=0, uname=debian60-clnode1, born=574 ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: MEMBERS LOST ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: NONE ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: ----------------------- I also checked times. They are in sync. _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
