Dietmar,
I think attached patch will solve your problem. See comment in patch to
understand, what/why is happening.
I've created BZ https://bugzilla.redhat.com/show_bug.cgi?id=568356
If you feel, that your problem is away, please put some comment to BZ.
About error 14. This is really not bug, it can happend when you are
trying to make another join form same application and newly in situation
leading to described problem.
Regards,
Honza
Dietmar Maurer wrote:
>> Best application for such test is testcpg.c. If there is really bug,
>> can you please create BZ (ideally with way to reproduce, because I'm really
>> not able to reproduce such behavior).
>
> I still wait for a BZ account, so I post it here. The attached
> program 'cpgtest' reproduces the problem. Compile with:
>
> # gcc -Wall cpgtest.c $(shell pkg-config --cflags --libs libcpg libcoroipcc)
> -o cpgtest
>
> It executes a simple loop:
>
> start:
> cpg_initialize
> cpg_join
> cpg_dispatch
> send one message in confchg_callback
> cpg_finalize after receiving that message
> goto start
>
> When I run that it executes several successful iterations, but sometime
> the join fails:
>
> # cgptest
> ...
> starting cpgtest
> calling cpg_initialize
> calling cpg_join
> cpg_join failed: 14
>
> An worse, sometimes it hangs in main loop:
>
> # cpgtest
> ...
> starting cpgtest
> calling cpg_initialize
> calling cpg_join
> starting main loop (hangs here)
>
> When that happens, I abort with CTRL-C. After that there is
> such a stale CPG member. After several runs I get:
>
> # corosync-cpgtool
> TESTGROUP\x00
> 4610 3 (192.168.2.8)
> 27678 3 (192.168.2.8)
> 21828 3 (192.168.2.8)
> 16841 3 (192.168.2.8)
> 10901 3 (192.168.2.8)
> 10773 3 (192.168.2.8)
> 10496 3 (192.168.2.8)
> 9866 3 (192.168.2.8)
> 8552 3 (192.168.2.8)
> 7439 3 (192.168.2.8)
> 6782 3 (192.168.2.8)
>
commit e1ee17d4f2453ed81e2c3183d1ea516d7483676f
Author: Jan Friesse <[email protected]>
Date: Thu Feb 25 15:38:25 2010 +0100
Cpg join with undelivered leave message
Patch handles situation, when on one node, one process:
- join cpg
- do same actions
- leave cpg
- join cpg again
Following sequence can (racy) end with broken process_info list.
To solve this problem, one more check is done in
message_handler_req_lib_cpg_join so if process_info with same pid and
group as new join request exists, CPG_ERR_EXIST is returned.
diff --git a/trunk/services/cpg.c b/trunk/services/cpg.c
index 68bd1ed..94ada17 100644
--- a/trunk/services/cpg.c
+++ b/trunk/services/cpg.c
@@ -1067,6 +1067,21 @@ static void message_handler_req_lib_cpg_join (void
*conn, const void *message)
}
}
+ /*
+ * Same check must be done in process info list, because there may be
not yet delivered
+ * leave of client.
+ */
+ for (iter = process_info_list_head.next; iter !=
&process_info_list_head; iter = iter->next) {
+ struct process_info *pi = list_entry (iter, struct
process_info, list);
+
+ if (pi->nodeid == api->totem_nodeid_get () && pi->pid ==
req_lib_cpg_join->pid &&
+ mar_name_compare(&req_lib_cpg_join->group_name, &pi->group)
== 0) {
+ /* We have same pid and group name joined -> return
error */
+ error = CPG_ERR_EXIST;
+ goto response_send;
+ }
+ }
+
switch (cpd->cpd_state) {
case CPD_STATE_UNJOINED:
error = CPG_OK;
--- /root/cpgtest.c 2010-02-25 09:45:43.000000000 +0100
+++ /root/test/a/cpgtest.c 2010-02-25 15:37:49.000000000 +0100
@@ -119,7 +119,7 @@
}
if (result != CS_OK) {
printf("cpg_join failed: %d\n", result);
- exit(-1);
+ goto retry;
}
fd_set read_fds;
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais