- **Comment**:

Bug  Analysis :
--------------------
checkpointHandle - is  A pointer to the checkpoint handle, allocated in the
address space of the invoking process (CPA/application) . CPA stores into the
memory area of CPA/application/process uses to access the checkpoint in
subsequent invocations of the functions of the Checkpoint Service API.
In the case of saCkptCheckpointOpenAsync() , saCkptCheckpointWrite() ,
this handle is returned in the corresponding response message.

Eevn though  saCkptCheckpointWrite()  is Sync request checkpointHandle is used 
by CPND for tracking
CPND<--->CPND messaging invoking activity on request of  
saCkptCheckpointWrite() .

If the ckpoint is SA_CKPT_CHECKPOINT_COLLOCATED &  SA_CKPT_WR_ALL_REPLICAS 
checkpoint ,
and  the checkpoint is opened on multiple nodes ,  and  saCkptCheckpointWrite() 
 are beeing requested by
multiple CPA`s from same Node , then the local CPND to update the all  other 
CPND`S ,
whic has opened the same Ckpt, to track the pending invocations/event a
teperory cpnd_evt entry added with key checkpointHandle  and after a successful 
write responserevived
from  all the CPND's , using this cpnd_evt entry  the local CPND will  response 
back to CPA/application with result,
and then deleted cpnd_evt entry from local CPND ( temporary only for tracking 
response message of peer CPND )

In current code CPA is using malloc() return value as  checkpointHandle for
unique reference key where malloc()  returns virtual memory specific to 
processes,
so malloc() can return the same pointer value in separate CPA processes .

If we are running two ckpt app (2 processes) try to write data into two 
difference/same checkpoints , it is possible that the checkpointHandle is being 
passed as
same from both checkpoint application processes , as same checkpointHandle is 
being shared to CPND by both CPA
as key reference, the CPND can miss behave because of ambiguous same  reference 
key.


Solution :
----------------
As  malloc() is standred call and the Checkpoint Service API Specification says
checkpointHandle should be pointer allocated in the address space of the 
invoking
process/CPA , CPND will return try again before the  saCkptCheckpointWrite() 
call ,
if multiple saCkptCheckpointWrite() request come to  CPND with SAME
checkpointHandle ( same virtual memory  address) from different CPA's at the 
same time.


Patch will be published soon.



---

** [tickets:#1467] cpav:two apps on same node simultaneously failed to write 
checkpoint and app hangs **

**Status:** accepted
**Milestone:** 4.5.2
**Created:** Mon Aug 31, 2015 04:05 AM UTC by A V Mahesh (AVM)
**Last Updated:** Mon Aug 31, 2015 04:05 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[test_3opens_app_A.c](https://sourceforge.net/p/opensaf/tickets/1467/attachment/test_3opens_app_A.c)
 (9.2 kB; application/octet-stream)


Steps to reproduce  :
------------------------------------------------------------
Step -1 :

   - Add sleep(30) in cpnd_evt_proc_nd2nd_ckpt_active_data_access_req() 
function of file
   `osaf/services/saf/cpsv/cpnd/cpnd_evt.c` at line no :3037 And build,  
install  and bringup Opensaf
    on all 4 nodes SC-1, SC-2, PL-3, SC-4 ).
   
   static uint32_t cpnd_evt_proc_nd2nd_ckpt_active_data_access_req(CPND_CB *cb, 
CPND_EVT *evt, CPSV_SEND_INFO *sinfo)
   {
       sleep(30);
       uint32_t rc = NCSCC_RC_SUCCESS;

     
Step -2
     Build Cpsv  `test_3opens_app_A.c &  test_3opens_app_B.c` application  on 
all 4 nodes SC-1, SC-2, PL-3, SC-4  ( attached to SR)      

   SC-1:# gcc test_3opens_app_A.c -o node_A  -lSaCkpt;
   SC-1:# gcc test_3opens_app_B.c -o node_B  -lSaCkpt;
   

Step -3  : Bring up Opensaf on all 4 nodes SC-1, SC-2, PL-3, SC-4

Step -4  : Run checkpoint application ./node_A In all 4 nodes SC-1, SC-2, PL-3, 
PL-4
          and don`t Press <Enter> key.

   SC-1:# ./node_A
   0 saCkptCheckpointOpen  returned checkpointHandle 626e60
   1 saCkptCheckpointOpen  returned checkpointHandle 626fe0
   2 saCkptCheckpointOpen  returned checkpointHandle 627270
   3 saCkptCheckpointOpen  returned checkpointHandle 6273f0
   4 saCkptCheckpointOpen  returned checkpointHandle 627570
   CPSV:CPA:ONsaCkptCheckpointWrite Waiting to Read from Checkpoint ....
   saCkptCheckpointWrite Press <Enter> key to continue...
   ====================================================

Step -5 :Run checkpoint application ./node_B only on 2 nodes SC-1 & SC-2
        and don`t Press <Enter> key.
   ====================================================
   SC-1:# ./node_B
   0 saCkptCheckpointOpen  returned checkpointHandle 626e60
   1 saCkptCheckpointOpen  returned checkpointHandle 626fe0
   2 saCkptCheckpointOpen  returned checkpointHandle 627270
   3 saCkptCheckpointOpen  returned checkpointHandle 6273f0
   4 saCkptCheckpointOpen  returned checkpointHandle 627570
   CPSV:CPA:ONsaCkptCheckpointWrite Waiting to Read from Checkpoint ....
   saCkptCheckpointWrite Press <Enter> key to continue...
   ====================================================

Step -6 : Press <Enter> key for ./node_A  ./node_B  application quickly  to 
write simultaneously on SC-1 only,
     then for  `node_B`  checkpoint  application you will will see /node_B 
application failed to write checkpoint
   ====================================================
   SC-1: # ./node_A
   0 saCkptCheckpointOpen  returned checkpointHandle 626e60
   1 saCkptCheckpointOpen  returned checkpointHandle 626fe0
   2 saCkptCheckpointOpen  returned checkpointHandle 627270
   3 saCkptCheckpointOpen  returned checkpointHandle 6273f0
   4 saCkptCheckpointOpen  returned checkpointHandle 627570
   CPSV:CPA:ONsaCkptCheckpointWrite Waiting to Read from Checkpoint ....
   saCkptCheckpointWrite Press <Enter> key to continue...

   1 saCkptCheckpointWrite  checkpointHandle 626e60
   2 saCkptCheckpointWrite  checkpointHandle 626e60
   3 saCkptCheckpointWrite  checkpointHandle 626e60
   4 saCkptCheckpointWrite  checkpointHandle 626e60
   222 saCkptCheckpointWrite  checkpointHandle 626e60
   saCkptCheckpointRead Waiting to Read from Checkpoint ....
   saCkptCheckpointRead Press <Enter> key to continue...
 
   SC-1:# ./node_B
   0 saCkptCheckpointOpen  returned checkpointHandle 626e60
   1 saCkptCheckpointOpen  returned checkpointHandle 626fe0
   2 saCkptCheckpointOpen  returned checkpointHandle 627270
   3 saCkptCheckpointOpen  returned checkpointHandle 6273f0
   4 saCkptCheckpointOpen  returned checkpointHandle 627570
   CPSV:CPA:ONsaCkptCheckpointWrite Waiting to Read from Checkpoint ....
 
   ====================================================


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to