----- Original Message ----- > From: "Kazunori INOUE" <kazunori.ino...@gmail.com> > To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> > Sent: Tuesday, March 18, 2014 12:30:01 AM > Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11 > > 2014-03-18 8:03 GMT+09:00 David Vossel <dvos...@redhat.com>: > > > > ----- Original Message ----- > >> From: "Kazunori INOUE" <kazunori.ino...@gmail.com> > >> To: "The Pacemaker cluster resource manager" > >> <pacemaker@oss.clusterlabs.org> > >> Sent: Monday, March 17, 2014 4:51:11 AM > >> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11 > >> > >> 2014-03-17 16:37 GMT+09:00 Kazunori INOUE <kazunori.ino...@gmail.com>: > >> > 2014-03-15 4:08 GMT+09:00 David Vossel <dvos...@redhat.com>: > >> >> > >> >> > >> >> ----- Original Message ----- > >> >>> From: "Kazunori INOUE" <kazunori.ino...@gmail.com> > >> >>> To: "pm" <pacemaker@oss.clusterlabs.org> > >> >>> Sent: Friday, March 14, 2014 5:52:38 AM > >> >>> Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11 > >> >>> > >> >>> Hi, > >> >>> > >> >>> When specifying the node name in UPPER case and performing > >> >>> crm_resource, crmd was aborted. > >> >>> (The real node name is a LOWER case.) > >> >> > >> >> https://github.com/ClusterLabs/pacemaker/pull/462 > >> >> > >> >> does that fix it? > >> >> > >> > > >> > Since behavior of glib is strange somehow, the result is NO. > >> > I tested this brunch. > >> > https://github.com/davidvossel/pacemaker/tree/lrm-segfault > >> > * Red Hat Enterprise Linux Server release 6.4 (Santiago) > >> > * glib2-2.22.5-7.el6.x86_64 > >> > > >> > strcase_equal() is not called from g_hash_table_lookup(). > >> > > >> > [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409 > >> > ...snip... > >> > (gdb) b lrm.c:1232 > >> > Breakpoint 1 at 0x4251d0: file lrm.c, line 1232. > >> > (gdb) b strcase_equal > >> > Breakpoint 2 at 0x429828: file lrm_state.c, line 95. > >> > (gdb) c > >> > Continuing. > >> > > >> > Breakpoint 1, do_lrm_invoke (action=288230376151711744, > >> > cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER, > >> > msg_data=0x7fff8d679540) at lrm.c:1232 > >> > 1232 lrm_state = lrm_state_find(target_node); > >> > (gdb) s > >> > lrm_state_find (node_name=0x1d4c650 "X3650H") at lrm_state.c:267 > >> > 267 { > >> > (gdb) n > >> > 268 if (!node_name) { > >> > (gdb) n > >> > 271 return g_hash_table_lookup(lrm_state_table, node_name); > >> > (gdb) p g_hash_table_size(lrm_state_table) > >> > $1 = 1 > >> > (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))->data > >> > $2 = 0x1c791a0 "x3650h" > >> > (gdb) p node_name > >> > $3 = 0x1d4c650 "X3650H" > >> > (gdb) n > >> > 272 } > >> > (gdb) n > >> > do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE, > >> > cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540) > >> > at lrm.c:1234 > >> > 1234 if (lrm_state == NULL && is_remote_node) { > >> > (gdb) n > >> > 1240 CRM_ASSERT(lrm_state != NULL); > >> > (gdb) n > >> > > >> > Program received signal SIGABRT, Aborted. > >> > 0x0000003787e328a5 in raise () from /lib64/libc.so.6 > >> > (gdb) > >> > > >> > > >> > I wonder why... so I will continue investigation. > >> > > >> > > >> > >> I read the code of g_hash_table_lookup(). > >> Key is compared by the hash value generated by crm_str_hash before > >> strcase_equal() is performed. > > > > good catch. I've updated the patch in this pull request. Can you give it a > > go? > > > > https://github.com/ClusterLabs/pacemaker/pull/462 > > > fail-count is not cleared only in this. > > $ crm_resource -C -r p1 -N X3650H > Cleaning up p1 on X3650H > Waiting for 1 replies from the CRMd. OK > > $ grep fail-count /var/log/ha-log > Mar 18 13:53:36 x3650g attrd[3610]: debug: attrd_client_message: > Broadcasting fail-count-p1[X3650H] = (null) > $ > > $ crm_mon -rf1 > Last updated: Tue Mar 18 13:54:51 2014 > Last change: Tue Mar 18 13:53:36 2014 by hacluster via crmd on x3650h > Stack: corosync > Current DC: x3650h (3232261384) - partition with quorum > Version: 1.1.10-83553fa > 2 Nodes configured > 1 Resources configured > > > Online: [ x3650g x3650h ] > > Full list of resources: > > p1 (ocf::pacemaker:Dummy): Stopped > > Migration summary: > * Node x3650h: > p1: migration-threshold=1 fail-count=1 last-failure='Tue Mar 18 > 13:53:19 2014' > * Node x3650g: > $ > > > So this change also seems to be necessary.
yep, added your patch to the pull request https://github.com/davidvossel/pacemaker/commit/c118ac5b5244890c19e4c7b2f5a39208d362b61d I found another one in stonith that I fixed. https://github.com/ClusterLabs/pacemaker/pull/462 Are we good for merging this now? -- Vossel _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org