On 6/13/2013 9:11 AM, Jan Friesse wrote:
> Fabio M. Di Nitto napsal(a):
>> On 6/13/2013 7:23 AM, Jan Friesse wrote:
>>> Fabio M. Di Nitto napsal(a):
>>>> On 6/12/2013 5:03 PM, Jan Friesse wrote:
>>>>> Signed-off-by: Jan Friesse <[email protected]>
>>>>> ---
>>>>> exec/votequorum.c | 3 +++
>>>>> 1 files changed, 3 insertions(+), 0 deletions(-)
>>>>>
>>>>> diff --git a/exec/votequorum.c b/exec/votequorum.c
>>>>> index b561409..131b734 100644
>>>>> --- a/exec/votequorum.c
>>>>> +++ b/exec/votequorum.c
>>>>> @@ -1634,6 +1634,9 @@ static void
>>>>> message_handler_req_exec_votequorum_nodeinfo (
>>>>>
>>>>> if (nodeid == VOTEQUORUM_QDEVICE_NODEID) {
>>>>> struct cluster_node *sender_node =
>>>>> find_node_by_nodeid(sender_nodeid);
>>>>> +
>>>>> + assert(sender_node != NULL);
>>>>> +
>>>>> if ((!cluster_is_quorate) &&
>>>>> (sender_node->flags & NODE_FLAGS_QUORATE)) {
>>>>> node->votes = req_exec_quorum_nodeinfo->votes;
>>>>>
>>>>
>>>> I would prefer a proper error checking rather than assert here.
>>>>
>>>
>>> Agree. Actually, coverity found this problem, but I was ensure if it is
>>> not problem (this is what I was assuming because of dereference of
>>> sender_node) and then assert is proper solution OR it is problem and
>>> then this patch starts discussion about proper solution.
>>>
>>>> The logic is that a node always send its own information (to populate
>>>> node for node == sender_nodeid) and then send quorum device info from
>>>> that node.
>>>>
>>>> See static int votequorum_sync_process (void) that´s the one triggering
>>>> the population of the node entries.
>>>>
>>>> In theory, if sync queues the messages as expected, sender_node is never
>>>> NULL. If it is, then i´ll prefer at least to see a message that says:
>>>> "received qdevice info before sender node info\n" and then quit (unless
>>>> we can recover from that status easily).
>>>>
>>>
>>> Actually, because you've wrote that code you are maybe able to see how
>>> to recover easily, but when I was reading that code, it really didn't
>>> look so.
>>
>> Ok so the question turns into:
>>
>> static int votequorum_sync_process (void)
>> {
>> votequorum_exec_send_nodeinfo(us->node_id); <- message 1
>> votequorum_exec_send_nodeinfo(VOTEQUORUM_QDEVICE_NODEID); <- 2
>> if (strlen(qdevice_name)) {
>>
>> votequorum_exec_send_qdevice_reg(VOTEQUORUM_QDEVICE_OPERATION_REGISTER,
>> qdevice_name);
>> }
>> return 0;
>> }
>>
>> given this sequence of events, is there a possibility that message 1 can
>> be delivered after message 2?
>>
>
> Short answer is NO. Long answer is, if there is such possibility, it's
> ultra high priority bug and must be fixed.
>
>> If there is that possibility, then we need to review the whole
>> send_nodeinfo logic as there is the assumption (in more than one place
>> IIRC) that they arrive in sequence.
>>
>> If there is the guarantee that they are delivered in the same order,
>> then we can use that assert check to verify it and error out by
>> reporting deeper issues.
>
> There is a lot of informations logged at the beginning of
> message_handler_req_exec_votequorum_nodeinfo. Assert will cause sigabrt
> so blackbox is created with all that logged informations. -> From my
> point of view, assert is enough, because axiom 1 tells us that msg 1 is
> always before msg 2 and axiom 2 is msg 1 contains informations about
> send_node -> assert never appears and if it appears, it's bug in code.
Ok works for me if you think the info are enough, otherwise just switch to
if (sender_node == NULL) {
log some extra stuff
die hard
}
Fabio
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss