The problem doesn't happen on the latest changeset.

According to the log, the problem happened because the ckpt_lcl_ref_cnt was 
still 1 after the application on PL4 finalized

Jul 17 17:18:07.958259 osafckptnd [13885:cpnd_evt.c:0540] T1 cpnd client ckpt 
close success ckpt_app_hdl:2,ckpt_id:1,ckpt_lcl_ref_cnt:1

By checking the code, the ckpt_lcl_ref_cnt is updated 2 times when the cpnd 
restarts. 
1. From the share memory. According to the log, look like this updating is 
correct. 2 clients were added.

Jul 17 17:18:06.971105 osafckptnd [13885:cpnd_res.c:0501] T1 cpnd client handle 
extracted 
Jul 17 17:18:06.971122 osafckptnd [13885:cpnd_res.c:0501] T1 cpnd client handle 
extracted 

2. From the message CPND_EVT_A2ND_CKPT_REFCNTSET. There is not much information 
about the message data in the log. 

Jul 17 17:18:06.978908 osafckptnd [13885:cpnd_evt.c:4202] >> 
cpnd_evt_proc_ckpt_refcntset 
Jul 17 17:18:06.978921 osafckptnd [13885:cpnd_evt.c:4214] << 
cpnd_evt_proc_ckpt_refcntset 

More information about this is needed for further investigation.

Suggestion: To improve tracing ckpt_lcl_ref_cnt cpnd for easier troubleshooting 
such kind of problem.


---

** [tickets:#509] reource leak when cpnd restarts and the application finalizes 
the checkpoint handles**

**Status:** accepted
**Milestone:** future
**Created:** Thu Jul 18, 2013 06:14 AM UTC by Sirisha Alla
**Last Updated:** Fri Oct 02, 2015 09:15 AM UTC
**Owner:** Pham Hoang Nhat
**Attachments:**

- 
[logs.tar.gz](https://sourceforge.net/p/opensaf/tickets/509/attachment/logs.tar.gz)
 (78.0 kB; application/x-gzip)


The issue is observed on changeset 4325 on SLES 4 node cluster VMs.

The issue is reproducible with the following steps:

Checkpoint applications running on PL-3 and PL-4

1) On PL-3 An asynchronous collocated checkpoint is created and the same 
checkpoint is opened for writing on the same node
2) On PL-4 the checkpoint is opened twice with write option
3) Active replica for the checkpoint is set on PL-3
4) A section is created in the checkpoint from PL-4
5) CPND is restarted on both the payloads
6) Checkpoint is unlinked and closed on PL-3
7) Active replica is set on PL-4 and a section is created in the checkpoint
8) Now the checkpoint handle being used by the application on PL-4 are 
finalized.

The replicas are expected to be deleted both on PL-3 and PL-4. IMM database do 
not have any references to the checkpoint table or replica table, but a stale 
checkpoint is found on PL-4 in the shared memory. The replica on PL-3 is 
deleted. However there seems to be no functional impact because of this stale 
resource. When a checkpoint with the same name is opened, a new replica is 
being created in the shared memory.

SLES-64BIT-SLOT4:/opt/goahead/tetware/opensaffire/bin64 # immfind | grep -i ckpt
safApp=safCkptService
SLES-64BIT-SLOT4:/opt/goahead/tetware/opensaffire/bin64 # ls -lrt 
/dev/shm/opensaf/
total 844
-rw-r--r-- 1 root root 132200 Jul 17 17:17 NCS_MQND_QUEUE_CKPT_INFO
-rw-r--r-- 1 root root 328000 Jul 17 17:17 NCS_GLND_RES_CKPT_INFO
-rw-r--r-- 1 root root 160000 Jul 17 17:17 NCS_GLND_LCK_CKPT_INFO
-rw-r--r-- 1 root root  88000 Jul 17 17:17 NCS_GLND_EVT_CKPT_INFO
-rw-r--r-- 1 root root   1688 Jul 17 17:18 
safCkpt=collocated_ckpt_name_101_13_132111_1
-rw-r--r-- 1 root root 704008 Jul 17 17:18 CPND_CHECKPOINT_INFO_132111

ckptd and ckptnd traces are attached. The time of test is 17th july 17:17:59.


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to