HJ Lee identified a problem which is described in more detail below. A patch is attached to resolve it.
If the last message as organized in the total order is not received by a processor, and that processor is still active, and no new messages are originated for fail_to_recv_const (default = 50) token rotations, fail to recv will happen improperly. The reason is that the proper information isn't used when determining the range of messages that should be checked for recovery, resulting in an "off by X" where X is the number of messages in the total order that have not been received by the processor at the end of the order. Example: Processor A sends A=1 B=2 C=3 Processor B sends D=4 E=5 F=6 Procesosr C sends G=7 H=8 I=9 Processor A receives A(1), B(2), C(3), D(4), E(5), F(6), G(7), H(8), I(9) Procesosr C receives A(1), B(2), C(3), D(4), E(5), F(6), G(7), H(8), I(9) Procesosr B receives A(1), B(2) D(4) then has some transient fault in the kernel which allows it to receive udp packets but temporarily disrupts its multicast transmit Processor B should request C, E, F, G, H, I to be added to the retransmit list. In the current code and example, processor B has a high_seq_received (the highest sequence the processor has currently received) of 4 a token->seq of 9. It uses high_seq_received (4) - the my all received (which is 2). This gives a range of 2 which will request recovery of missing messages for 3-4. In this example, totem will only recover C(3) but not E-I (the messages at the end of the ordering). Instead the retransmit list should have a range of 7 (token->seq - processor's my_aru). This will request retranmissions of messages that are missing on the local processor from 3-9. If no new messages are received within the fail_to_recv_const window increasing high_seq_received on the processor, fail to recv occurs.
Index: exec/totemsrp.c =================================================================== --- exec/totemsrp.c (revision 2684) +++ exec/totemsrp.c (working copy) @@ -2475,7 +2475,7 @@ * but only retry if there is room in the retransmit list */ - range = instance->my_high_seq_received - instance->my_aru; + range = orf_token->seq - instance->my_aru; assert (range < QUEUE_RTR_ITEMS_SIZE_MAX); for (i = 1; (orf_token->rtr_list_entries < RETRANSMIT_ENTRIES_MAX) &&
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
