The problem doesn't happen on the latest changeset.
According to the log, the problem happened because the ckpt_lcl_ref_cnt was
still 1 after the application on PL4 finalized
Jul 17 17:18:07.958259 osafckptnd [13885:cpnd_evt.c:0540] T1 cpnd client ckpt
close success ckpt_app_hdl:2,ckpt_id:1,ckpt_lcl_ref_cnt:1
By checking the code, the ckpt_lcl_ref_cnt is updated 2 times when the cpnd
restarts.
1. From the share memory. According to the log, look like this updating is
correct. 2 clients were added.
Jul 17 17:18:06.971105 osafckptnd [13885:cpnd_res.c:0501] T1 cpnd client handle
extracted
Jul 17 17:18:06.971122 osafckptnd [13885:cpnd_res.c:0501] T1 cpnd client handle
extracted
2. From the message CPND_EVT_A2ND_CKPT_REFCNTSET. There is not much information
about the message data in the log.
Jul 17 17:18:06.978908 osafckptnd [13885:cpnd_evt.c:4202] >>
cpnd_evt_proc_ckpt_refcntset
Jul 17 17:18:06.978921 osafckptnd [13885:cpnd_evt.c:4214] <<
cpnd_evt_proc_ckpt_refcntset
More information about this is needed for further investigation.
Suggestion: To improve tracing ckpt_lcl_ref_cnt cpnd for easier troubleshooting
such kind of problem.
---
** [tickets:#509] reource leak when cpnd restarts and the application finalizes
the checkpoint handles**
**Status:** accepted
**Milestone:** future
**Created:** Thu Jul 18, 2013 06:14 AM UTC by Sirisha Alla
**Last Updated:** Fri Oct 02, 2015 09:15 AM UTC
**Owner:** Pham Hoang Nhat
**Attachments:**
-
[logs.tar.gz](https://sourceforge.net/p/opensaf/tickets/509/attachment/logs.tar.gz)
(78.0 kB; application/x-gzip)
The issue is observed on changeset 4325 on SLES 4 node cluster VMs.
The issue is reproducible with the following steps:
Checkpoint applications running on PL-3 and PL-4
1) On PL-3 An asynchronous collocated checkpoint is created and the same
checkpoint is opened for writing on the same node
2) On PL-4 the checkpoint is opened twice with write option
3) Active replica for the checkpoint is set on PL-3
4) A section is created in the checkpoint from PL-4
5) CPND is restarted on both the payloads
6) Checkpoint is unlinked and closed on PL-3
7) Active replica is set on PL-4 and a section is created in the checkpoint
8) Now the checkpoint handle being used by the application on PL-4 are
finalized.
The replicas are expected to be deleted both on PL-3 and PL-4. IMM database do
not have any references to the checkpoint table or replica table, but a stale
checkpoint is found on PL-4 in the shared memory. The replica on PL-3 is
deleted. However there seems to be no functional impact because of this stale
resource. When a checkpoint with the same name is opened, a new replica is
being created in the shared memory.
SLES-64BIT-SLOT4:/opt/goahead/tetware/opensaffire/bin64 # immfind | grep -i ckpt
safApp=safCkptService
SLES-64BIT-SLOT4:/opt/goahead/tetware/opensaffire/bin64 # ls -lrt
/dev/shm/opensaf/
total 844
-rw-r--r-- 1 root root 132200 Jul 17 17:17 NCS_MQND_QUEUE_CKPT_INFO
-rw-r--r-- 1 root root 328000 Jul 17 17:17 NCS_GLND_RES_CKPT_INFO
-rw-r--r-- 1 root root 160000 Jul 17 17:17 NCS_GLND_LCK_CKPT_INFO
-rw-r--r-- 1 root root 88000 Jul 17 17:17 NCS_GLND_EVT_CKPT_INFO
-rw-r--r-- 1 root root 1688 Jul 17 17:18
safCkpt=collocated_ckpt_name_101_13_132111_1
-rw-r--r-- 1 root root 704008 Jul 17 17:18 CPND_CHECKPOINT_INFO_132111
ckptd and ckptnd traces are attached. The time of test is 17th july 17:17:59.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets