I analyzed the traces of oasafckptd_1,oasafckptd_2,osafckpnd_1,osafckptnd_2
1)I observed that opensaf_safCkpt=active_replica_ckpt_name_1_sysgrou is a non
collocated checkpoint because
In the file:
"osaf/services/saf/cpsv/cpnd/cpnd_evt.c"
cpnd_evt_proc_ckpt_finalize()
TRACE_1("cpnd client ckpt close success
ckpt_app_hdl:%llx,ckpt_id:%llx,ckpt_lcl_ref_cnt:%u",cl_node->ckpt_app_hdl,
cp_node->ckpt_id, cp_node->ckpt_lcl_ref_cnt);
/* Check for Non-Collocated Replica */
if
(m_CPND_IS_COLLOCATED_ATTR_SET(cp_node->create_attrib.creationFlags)) {
rc = cpnd_ckpt_replica_close(cb, cp_node, &error);
if (rc == NCSCC_RC_FAILURE) {
TRACE_4("cpnd ckpt replica close failed
ckpt_id:%llx",cp_node->ckpt_id);
send_evt.info.cpa.info.finRsp.error = error;
goto agent_rsp;
}
a)In trace, i got "T1 cpnd client ckpt close success
ckpt_app_hdl:20,ckpt_id:10,ckpt_lcl_ref_cnt:3" and " T1 cpnd
client ckpt close success ckpt_app_hdl:20,ckpt_id:10,ckpt_lcl_ref_cnt:2"
b)In this function cpnd_evt_proc_ckpt_finalize(), it is not entered into this
if condition "if
(m_CPND_IS_COLLOCATED_ATTR_SET(cp_node->create_attrib.creationFlags))"
so,it is non collocated checkpoint.
2) The checkpoint is finalized for only 2 ref count
Apr 15 13:57:41.492709 osafckptnd [18544:cpnd_evt.c:0489] >>
cpnd_evt_proc_ckpt_finalize
Apr 15 13:57:41.492741 osafckptnd [18544:cpnd_res.c:1345] >>
cpnd_restart_client_reset
Apr 15 13:57:41.492750 osafckptnd [18544:cpnd_res.c:0833] >>
cpnd_find_exact_ckptinfo
Apr 15 13:57:41.492759 osafckptnd [18544:cpnd_res.c:0851] <<
cpnd_find_exact_ckptinfo
Apr 15 13:57:41.492767 osafckptnd [18544:cpnd_res.c:1370] <<
cpnd_restart_client_reset
Apr 15 13:57:41.492775 osafckptnd [18544:cpnd_proc.c:2206] >>
cpnd_send_ckpt_usr_info_to_cpd
Apr 15 13:57:41.492783 osafckptnd [18544:cpnd_mds.c:1124] >> cpnd_mds_msg_send
Apr 15 13:57:41.492796 osafckptnd [18544:cpnd_mds.c:0212] >> cpnd_mds_callback
Apr 15 13:57:41.492808 osafckptnd [18544:cpnd_mds.c:0581] << cpnd_mds_enc_flat
Apr 15 13:57:41.492816 osafckptnd [18544:cpnd_mds.c:0254] << cpnd_mds_callback
Apr 15 13:57:41.493037 osafckptnd [18544:cpnd_mds.c:1161] << cpnd_mds_msg_send
Apr 15 13:57:41.493067 osafckptnd [18544:cpnd_proc.c:2230] <<
cpnd_send_ckpt_usr_info_to_cpd
Apr 15 13:57:41.493081 osafckptnd [18544:cpnd_evt.c:0544] T1 cpnd client ckpt
close success ckpt_app_hdl:20,ckpt_id:10,ckpt_lcl_ref_cnt:3
Apr 15 13:57:41.493098 osafckptnd [18544:cpnd_res.c:1345] >>
cpnd_restart_client_reset
Apr 15 13:57:41.493114 osafckptnd [18544:cpnd_res.c:0833] >>
cpnd_find_exact_ckptinfo
Apr 15 13:57:41.493127 osafckptnd [18544:cpnd_res.c:0851] <<
cpnd_find_exact_ckptinfo
Apr 15 13:57:41.493144 osafckptnd [18544:cpnd_res.c:1370] <<
cpnd_restart_client_reset
Apr 15 13:57:41.493158 osafckptnd [18544:cpnd_proc.c:2206] >>
cpnd_send_ckpt_usr_info_to_cpd
Apr 15 13:57:41.493173 osafckptnd [18544:cpnd_mds.c:1124] >> cpnd_mds_msg_send
Apr 15 13:57:41.493189 osafckptnd [18544:cpnd_mds.c:0212] >> cpnd_mds_callback
Apr 15 13:57:41.493200 osafckptnd [18544:cpnd_mds.c:0581] << cpnd_mds_enc_flat
Apr 15 13:57:41.493236 osafckptnd [18544:cpnd_mds.c:0254] << cpnd_mds_callback
Apr 15 13:57:41.493456 osafckptnd [18544:cpnd_mds.c:1161] << cpnd_mds_msg_send
Apr 15 13:57:41.493499 osafckptnd [18544:cpnd_proc.c:2230] <<
cpnd_send_ckpt_usr_info_to_cpd
Apr 15 13:57:41.493516 osafckptnd [18544:cpnd_evt.c:0544] T1 cpnd client ckpt
close success ckpt_app_hdl:20,ckpt_id:10,ckpt_lcl_ref_cnt:2
Apr 15 13:57:41.493534 osafckptnd [18544:cpnd_evt.c:0558] T1 cpnd client
finalize success for ckpt app hdl:20
Apr 15 13:57:41.493550 osafckptnd [18544:cpnd_res.c:1064] >>
cpnd_restart_client_node_del
Apr 15 13:57:41.493566 osafckptnd [18544:cpnd_res.c:0708] T1 cpnd cli info
write header success
Apr 15 13:57:41.493580 osafckptnd [18544:cpnd_res.c:1085] T1 cpnd ckpt info
write success
Apr 15 13:57:41.493592 osafckptnd [18544:cpnd_res.c:1088] <<
cpnd_restart_client_node_del
Apr 15 13:57:41.493605 osafckptnd [18544:cpnd_mds.c:0987] >> cpnd_mds_send_rsp
Apr 15 13:57:41.493625 osafckptnd [18544:cpnd_mds.c:0212] >> cpnd_mds_callback
Apr 15 13:57:41.493643 osafckptnd [18544:cpnd_mds.c:0581] << cpnd_mds_enc_flat
Apr 15 13:57:41.493656 osafckptnd [18544:cpnd_mds.c:0254] << cpnd_mds_callback
Apr 15 13:57:41.493690 osafckptnd [18544:cpnd_mds.c:1008] << cpnd_mds_send_rsp
Apr 15 13:57:41.493707 osafckptnd [18544:cpnd_evt.c:0574] <<
cpnd_evt_proc_ckpt_finalize
Apr 15 13:5
So, the ref_cnt is 2
3) So, the node director send this information to director as
CPSV_USR_INFO_CKPT_CLOSE from the finalize function().
4) So, the checkpoint is not deleted and still presented in the memory at SC-1.
5) At the Director of SC-1,
In the file:
osaf/services/saf/cpsv/cpd/cpd_evt.c
In this function:
cpd_evt_proc_ckpt_usr_info()
if (evt->info.ckpt_usr_info.info_type == CPSV_USR_INFO_CKPT_CLOSE_LAST) {
if
(!(m_IS_SA_CKPT_CHECKPOINT_COLLOCATED(&ckpt_node->attributes))) {
TRACE(" non-collocated CPSV_CKPT_RDSET_START
ckpt_node->ckpt_id %llu ckpt_node->num_users %d ",
(SaUint64T)ckpt_node->ckpt_id,ckpt_node->num_users);
if (ckpt_node->num_users == 1) {
/* Clients for non-collocated Ckpt , Stop ret
timer , broadcast to all CPNDs */
memset(&send_evt, 0, sizeof(CPSV_EVT));
send_evt.type = CPSV_EVT_TYPE_CPND;
send_evt.info.cpnd.type =
CPND_EVT_D2ND_CKPT_RDSET;
send_evt.info.cpnd.info.rdset.ckpt_id =
ckpt_node->ckpt_id;
send_evt.info.cpnd.info.rdset.type =
CPSV_CKPT_RDSET_START;
rc = cpd_mds_bcast_send(cb, &send_evt,
NCSMDS_SVC_ID_CPND);
TRACE_2("cpd ckpt rdset success for
ckpt_id:%llx,active_dest:%"PRIu64,ckpt_node->ckpt_id,ckpt_node->active_dest);
}
if (m_CPND_IS_ON_SCXB(ckpt_node->ckpt_on_scxb1,
cpd_get_slot_sub_id_from_mds_dest(sinfo->dest))) {
if (evt->info.ckpt_usr_info.info_type ==
CPSV_USR_INFO_CKPT_CLOSE_LAST) {
ckpt_node->ckpt_on_scxb1 = 0;
}
}
if (m_CPND_IS_ON_SCXB(ckpt_node->ckpt_on_scxb2,
cpd_get_slot_sub_id_from_mds_dest(sinfo->dest))) {
if (evt->info.ckpt_usr_info.info_type ==
CPSV_USR_INFO_CKPT_CLOSE_LAST) {
ckpt_node->ckpt_on_scxb2 = 0;
}
}
If the node director sent the "CPSV_USR_INFO_CKPT_CLOSE_LAST" then we get
"non-collocated CPSV_CKPT_RDSET_START ckpt_node->ckpt_id" in the trace of
director
We cannot find this " non-collocated CPSV_CKPT_RDSET_START ckpt_node->ckpt_id"
in the trace of director for ckpt_id 10.
This means that two references of this checkpoint existed some where in the 70
node cluster.
Hence, the replica at SC-1 is visible even the application has finalized it.
---
** [tickets:#1317] ckpt : stale replicas observed in a 70 node cluster**
**Status:** accepted
**Milestone:** 5.18.09
**Created:** Wed Apr 15, 2015 10:16 AM UTC by Sirisha Alla
**Last Updated:** Tue Sep 11, 2018 09:18 AM UTC
**Owner:** Mohan Kanakam
**Attachments:**
-
[logs.tar.bz2](https://sourceforge.net/p/opensaf/tickets/1317/attachment/logs.tar.bz2)
(6.5 MB; application/x-bzip)
This issue is observed on cs6377 (46FC Tag). The cluster is 0f 70 nodes and 2
checkpoint applications run on each node. The application running on the active
controller creates the checkpoint, while the applications running on other
nodes open the same checkpoint and use them. After sections are created,
written and read from all the applications finalizes the handles used. The
retention duration of the checkpoint is specified to a minimal value of 1000
nanoseconds.
/dev/shm on the active controller after the applications exited.
SLES-64BIT-SLOT1:~ # date;ls -lrt /dev/shm/
Wed Apr 15 14:25:09 IST 2015
total 1772
-rw-r--r-- 1 opensaf opensaf 1076040 Apr 15 13:38
opensaf_NCS_MQND_QUEUE_CKPT_INFO
-rw-r--r-- 1 opensaf opensaf 328000 Apr 15 13:38 opensaf_NCS_GLND_RES_CKPT_INFO
-rw-r--r-- 1 opensaf opensaf 160000 Apr 15 13:38 opensaf_NCS_GLND_LCK_CKPT_INFO
-rw-r--r-- 1 opensaf opensaf 88000 Apr 15 13:38 opensaf_NCS_GLND_EVT_CKPT_INFO
-rw-r--r-- 1 opensaf opensaf 704008 Apr 15 13:38
opensaf_CPND_CHECKPOINT_INFO_131343
-rw-r--r-- 1 opensaf opensaf 79848 Apr 15 13:55
opensaf_safCkpt=active_replica_ckpt_name_1_sysgrou_131343_4
-rw-r--r-- 1 opensaf opensaf 79848 Apr 15 13:56
opensaf_safCkpt=active_replica_ckpt_name_1_sysgrou_131343_9
-rw-r--r-- 1 opensaf opensaf 79848 Apr 15 13:57
opensaf_safCkpt=active_replica_ckpt_name_1_sysgrou_131343_16
SLES-64BIT-SLOT1:~ # date;immfind|grep -i ckpt
Wed Apr 15 14:25:11 IST 2015
safApp=safCkptService
SLES-64BIT-SLOT1:~ #
When the same checkpoint name is being tried created, checkpoint service is not
creating a new replica in the shared memory.
cpd,cpnd traces are attached.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list._______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets