Hi Sergio,

We are not able to reproduce the issue as per the steps shared by you on 
version 5.22.01.

So, can you please send us the immd , immnd , ckptd , ckptnd , syslog and 
mdslog of all the nodes of the cluster.

 

Thanks & Regards

Mohan Kanakam | 91-8333082448

Senior Software Engineer

High Availability Solutions

 www.GetHighAvailability.com

Get High Availability Today !

NJ, USA: 1 508-507-6507    |    Hyderabad, India: 91 798-992-5293

 

 

From: Mohan Kanakam [mailto:mo...@gethighavailability.com] 
Sent: 28 February 2022 20:55
To: 'Sérgio Marques'
Cc: 'Nagendra Kumar'
Subject: RE: [opensaf:tickets] #3306 ckpt: checkpoint node director responding 
to async call.

 

Hi Sergio,

Thanks for the information.

We will try to reproduce and get back to you.

 

Thanks & Regards

Mohan Kanakam | 91-8333082448

Senior Software Engineer

High Availability Solutions

 www.GetHighAvailability.com

Get High Availability Today !

NJ, USA: 1 508-507-6507    |    Hyderabad, India: 91 798-992-5293

 

From: Sérgio Marques [mailto:sergio-l-marq...@alticelabs.com] 
Sent: 28 February 2022 20:46
To: Mohan Kanakam
Cc: 'Nagendra Kumar'
Subject: RE: [opensaf:tickets] #3306 ckpt: checkpoint node director responding 
to async call.

 

Hi Mohan,

 

I believe I have finally found a way for you to reproduce the problem:

Please try the following steps:

1.      Start 2 controllers with SC-2 Active and SC-1 Standby and 2 payloads, 
PL-1 and PL-2.
2.      At PL-2 create a checkpoint, a section and write on it.
3.      At PL-1 create exactly the same checkpoint created in PL-2 and try to 
create the same section as previously created. You will receive a 
SA_AIS_ERR_EXIST. Do a SectionOverwrite.
4.      At SC-1 perform a si-swap (amf-adm -t 10 si-swap 
safSi=SC-2N,safApp=OpenSAF) and reboot SC-1 and PL-1 nodes.
5.      Wait for SC-1 and PL-1 to rejoin the cluster.
6.      At SC-2 perform a si-swap (amf-adm -t 10 si-swap 
safSi=SC-2N,safApp=OpenSAF) and then list the checkpoint using immlist.

 

Thanks and regards,

Sérgio Marques

 

From: Mohan Kanakam <mo...@gethighavailability.com> 
Sent: 18 de fevereiro de 2022 13:40
To: Sérgio Marques <sergio-l-marq...@alticelabs.com>
Cc: 'Nagendra Kumar' <nagen...@gethighavailability.com>; 
opensaf-users@lists.sourceforge.net
Subject: RE: [opensaf:tickets] #3306 ckpt: checkpoint node director responding 
to async call.

 

Atenção: Este email foi originado fora da Altice Portugal. Por favor, não 
clique em links nem abra anexos, a não ser que conheça o remetente e saiba que 
o seu conteúdo é seguro.

 

Hi Sergio,

Thanks for the testing and sharing the results.

We try to reproduce the issue in our lab setup, unfortunately we are not able 
to reproduce.

These are the steps we followed :

1.      Start 2 controllers with SC-1 Act and SC-2 Standby and  PL-3 payload
2.      Create checkpoints by applications running on  payload
3.      Reboot SC-1 (Act). SC-2 becomes Active. And SC-1 joins as Standby.
4.      Now perform si-swap. SC-2 becomes Standby and SC-1 becomes Active
5.      Reboot SC-1 again.
6.      While it is rebooting, perform immlist on checkpoints created. Here we 
got the output of immlist.

 Can you please confirm, this is the way to reproduce it or not?

Did this issue continue after rebooted controller joined the cluster i.e., 
immlist worked after rebooted controller  joined the cluster?

I was thinking that, this could be a transient issue.

Can you please share immd, immnd, amfd, amfnd, ckptd, ckptnd, mds.log and 
syslog from all the nodes.

 

 Thanks & Regards

Mohan Kanakam | 91-8333082448

Senior Software Engineer

High Availability Solutions

 www.GetHighAvailability.com

Get High Availability Today !

NJ, USA: 1 508-507-6507    |    Hyderabad, India: 91 798-992-5293

 

From: Sérgio Marques [mailto:sergio-l-marq...@alticelabs.com] 
Sent: 17 February 2022 22:53
To: mo...@gethighavailability.com
Subject: RE: [opensaf:tickets] #3306 ckpt: checkpoint node director responding 
to async call.

 

Hi Mohan,

 

I’ve done a small change in your patch to be able of compiling it.

Where you have “sinfo->ctxt->length” I’ve changed it to “sinfo->ctxt.length”.

It resolves the problem. Now, there is no “MDS_SND_RCV: Invalid Sync CTXT Len” 
events being registered in mds.log. Thanks!

Unfortunately, this does not resolve another issue that we also have and were 
hoping to resolve it with this patch as well.

We set a cluster with 2 controller and 2 payload nodes, then we create a 
checkpoint like the following one:

 

[root@OLT2T4-UNICOM-2~]# immlist 
safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService

Name                                               Type         Value(s)

========================================================================

safCkpt                                            SA_STRING_T  
safCkpt=CKPT_BACKPLANE_CONTROL

saCkptCheckpointUsedSize                           SA_UINT64_T  2024 (0x7e8)

saCkptCheckpointSize                               SA_UINT64_T  2024 (0x7e8)

saCkptCheckpointRetDuration                        SA_TIME_T    
9223372036854775807 (0x7fffffffffffffff, Sat Jan 27 10:50:44 1990)

saCkptCheckpointNumWriters                         SA_UINT32_T  7 (0x7)

saCkptCheckpointNumSections                        SA_UINT32_T  22 (0x16)

saCkptCheckpointNumReplicas                        SA_UINT32_T  2 (0x2)

saCkptCheckpointNumReaders                         SA_UINT32_T  7 (0x7)

saCkptCheckpointNumOpeners                         SA_UINT32_T  7 (0x7)

saCkptCheckpointNumCorruptSections                 SA_UINT32_T  0 (0x0)

saCkptCheckpointMaxSections                        SA_UINT32_T  22 (0x16)

saCkptCheckpointMaxSectionSize                     SA_UINT64_T  92 (0x5c)

saCkptCheckpointMaxSectionIdSize                   SA_UINT64_T  1 (0x1)

saCkptCheckpointCreationTimestamp                  SA_TIME_T    
1645097377000000000 (0x16d48f5929030a00, Thu Feb 17 11:29:37 2022)

saCkptCheckpointCreationFlags                      SA_UINT32_T  2 (0x2)

SaImmAttrImplementerName                           SA_STRING_T  
safCheckPointService

SaImmAttrClassName                                 SA_STRING_T  SaCkptCheckpoint

SaImmAttrAdminOwnerName                            SA_STRING_T  <Empty>

 

After swapping (amf-adm -t 10 si-swap safSi=SC-2N,safApp=OpenSAF) and rebooting 
the active controller node for the second time, immlist starts returning 
SA_AIS_ERR_NO_RESOURCES:

 

[root@OLT2T4-UNICOM-2~]# immlist 
safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService

error - saImmOmAccessorGet_2 FAILED: SA_AIS_ERR_NO_RESOURCES (18)

 

The checkpoint can be found using immfind but not listed width immlist neither 
accessed via the libSaCkpt.so library.

 

[root@OLT2T4-UNICOM-2~]# immfind 
safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService

safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService

safReplica=safNode=CC-1\,safCluster=myClmCluster,safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService

safReplica=safNode=CC-2\,safCluster=myClmCluster,safCkpt=CKPT_BACKPLANE_CONTROL,safApp=safCkptService

 

If I only perform the swap command, without the reboot, this issue is not 
reproduced.

I don’t have this issue with the 4.5.2 OpenSAF version.

Do you have an idea of what could cause such thing and how should we debug this 
issue?

 

Many thanks and regards,

Sérgio Marques

 

 

From: Mohan Kanakam <mohan-has...@users.sourceforge.net> 
Sent: 17 de fevereiro de 2022 10:51
To: [opensaf:tickets] <3...@tickets.opensaf.p.re.sourceforge.net>
Subject: [opensaf:tickets] #3306 ckpt: checkpoint node director responding to 
async call.

 

Atenção: Este email foi originado fora da Altice Portugal. Por favor, não 
clique em links nem abra anexos, a não ser que conheça o remetente e saiba que 
o seu conteúdo é seguro.

 

Hi Sergio,
can you please test the attached patch for your scenario and share your 
observations.
thanks

Attachments:

*       mds_error.patch 
<https://sourceforge.net/p/opensaf/tickets/_discuss/thread/04984c7ecf/8052/attachment/mds_error.patch>
  (703 Bytes; application/octet-stream) 

  _____  

[tickets:#3306] <https://sourceforge.net/p/opensaf/tickets/3306/>  ckpt: 
checkpoint node director responding to async call.

Status: accepted
Milestone: 5.22.04
Created: Thu Feb 17, 2022 10:46 AM UTC by Mohan Kanakam
Last Updated: Thu Feb 17, 2022 10:46 AM UTC
Owner: Mohan Kanakam

During section create, one ckptnd sends async request(normal mds send) to 
another ckptnd. But, another ckptnd is responding to the request in assumption 
that it received the sync request and it has to respond to the sender ckptnd. 
In few cases, it is needed to respond when a sync req comes to ckptnd, but in 
few cases, it receives async req and it needn't respond async request.
We are getting the following messages in mds log when creating the section:
sc1-VirtualBox osafckptnd 27692 mds.log [meta sequenceId="2"] MDS_SND_RCV: 
Invalid Sync CTXT Len

  _____  

Sent from sourceforge.net because you indicated interest in 
https://sourceforge.net/p/opensaf/tickets/3306/

To unsubscribe from further messages, please visit 
https://sourceforge.net/auth/subscriptions/


_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to