Hi Sirisha,

This is not related to #1059. Please open a new ticket on CLM. following is the analyses.


Analyses:

1. Rtobjectupdate is failed CLM tries to do Reinit with IMM
Sep 11 14:58:21.425773 osafclmd [9606:imma_oi_api.c:2279] >> saImmOiRtObjectUpdate_2 Sep 11 14:58:24.709408 osafclmd [9606:imma_oi_api.c:2554] << saImmOiRtObjectUpdate_2 Sep 11 14:58:24.709606 osafclmd [9606:clms_imm.c:0488] IN saImmOiRtObjectUpdate for safNode=PL-4,safCluster=myClmCluster failed with rc = 20. Reinit with IMM
Sep 11 14:58:24.709700 osafclmd [9606:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 11 14:58:24.711075 osafclmd [9606:imma_proc.c:0147] >> imma_callback_ipc_destroy Sep 11 14:58:24.711160 osafclmd [9606:imma_proc.c:0206] TR Deleting client node Sep 11 14:58:24.711213 osafclmd [9606:imma_init.c:0323] >> imma_shutdown: use count 1
Sep 11 14:58:24.711247 osafclmd [9606:imma_init.c:0208] >> imma_destroy
Sep 11 14:58:24.711605 osafclmd [9606:imma_db.c:1318] >> imma_db_destroy
Sep 11 14:58:24.711643 osafclmd [9606:imma_db.c:0129] >> imma_client_tree_destroy Sep 11 14:58:24.711672 osafclmd [9606:imma_db.c:0152] >> imma_client_tree_cleanup Sep 11 14:58:24.711701 osafclmd [9606:imma_db.c:0167] << imma_client_tree_cleanup Sep 11 14:58:24.711732 osafclmd [9606:imma_db.c:0135] << imma_client_tree_destroy
Sep 11 14:58:24.711762 osafclmd [9606:imma_db.c:1327] << imma_db_destroy
Sep 11 14:58:24.711795 osafclmd [9606:imma_init.c:0334] << imma_shutdown: use count 0
Sep 11 14:58:24.711827 osafclmd [9606:imma_oi_api.c:0790] << saImmOiFinalize
Sep 11 14:58:24.711860 osafclmd [9606:clms_imm.c:2319] >> clm_imm_reinit_bg
Sep 11 14:58:24.711952 osafclmd [9606:clms_imm.c:2326] << clm_imm_reinit_bg

2. Again Rtobjectupdate is failed CLM tries to do Reinit with IMM
sep 11 14:58:24.715452 osafclmd [9606:clms_imm.c:0488] IN saImmOiRtObjectUpdate for safNode=SC-1,safCluster=myClmCluster failed with rc = 9. Reinit with IMM
Sep 11 14:58:24.715480 osafclmd [9606:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 11 14:58:24.715505 osafclmd [9606:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!
Sep 11 14:58:24.715530 osafclmd [9606:clms_imm.c:2319] >> clm_imm_reinit_bg
Sep 11 14:58:24.715638 osafclmd [9606:clms_imm.c:2326] << clm_imm_reinit_bg

3.Rtobjectupdate is failed CLM tries to do Reinit with IMM
Sep 11 14:58:24.719538 osafclmd [9606:imma_oi_api.c:2275] T2 ERR_BAD_HANDLE: No initialized handle exists! Sep 11 14:58:24.719557 osafclmd [9606:clms_imm.c:0488] IN saImmOiRtObjectUpdate for safNode=SC-2,safCluster=myClmCluster failed with rc = 9. Reinit with IMM
Sep 11 14:58:24.719571 osafclmd [9606:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 11 14:58:24.719584 osafclmd [9606:imma_oi_api.c:0626] T2 ERR_BAD_HANDLE: No initialized handle exists!
Sep 11 14:58:24.719599 osafclmd [9606:clms_imm.c:2319] >> clm_imm_reinit_bg
Sep 11 14:58:24.719690 osafclmd [9606:clms_imm.c:2326] << clm_imm_reinit_bg

4. CLM does Oiinitialize three times

Sep 11 14:58:24.718336 osafclmd [9606:clms_imm.c:2272] >> clm_imm_reinit_thread Sep 11 14:58:24.720346 osafclmd [9606:imma_init.c:0263] >> imma_startup: use count 0
Sep 11 14:58:24.727340 osafclmd [9606:imma_mds.c:0421] T3 IMMND UP
Sep 11 14:58:24.727378 osafclmd [9606:osaf_secutil.c:0270] >> osaf_auth_server_connect Sep 11 14:58:24.727730 osafclmd [9606:osaf_secutil.c:0322] << osaf_auth_server_connect Sep 11 14:58:24.727750 osafclmd [9606:mds_main.c:0249] TR received type:78, status:0 Sep 11 14:58:24.727765 osafclmd [9606:imma_proc.c:0419] >> imma_determine_clients_to_resurrect Sep 11 14:58:24.727781 osafclmd [9606:imma_proc.c:0501] << imma_determine_clients_to_resurrect Sep 11 14:58:24.727808 osafclmd [9606:imma_init.c:0063] >> imma_sync_with_immnd Sep 11 14:58:24.727829 osafclmd [9606:imma_init.c:0173] TR Client agent successfully initialized Sep 11 14:58:24.727841 osafclmd [9606:imma_init.c:0286] << imma_startup: use count 1 Sep 11 14:58:24.727860 osafclmd [9606:imma_oi_api.c:0122] T2 OI client version A.2.1 Sep 11 14:58:24.729020 osafclmd [9606:imma_oi_api.c:0250] T1 Trying to add OI client id:324 node:2020f handle:1440002020f Sep 11 14:58:24.729384 osafclmd [9606:imma_oi_api.c:0341] << saImmOiInitialize_2 Sep 11 14:58:24.729456 osafclmd [9606:imma_oi_api.c:0364] >> saImmOiSelectionObjectGet Sep 11 14:58:24.729482 osafclmd [9606:imma_oi_api.c:0440] << saImmOiSelectionObjectGet


Sep 11 14:58:24.734381 osafimmnd [9565:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_IMM_OI_INIT (4) from 2020f Sep 11 14:58:24.734549 osafimmnd [9565:ImmModel.cc:3465] >> accessControlMode Sep 11 14:58:24.734595 osafimmnd [9565:ImmModel.cc:3482] << accessControlMode: 0 Sep 11 14:58:24.734619 osafimmnd [9565:immnd_evt.c:0818] T2 Added client with id: 1450002020f <node:2020f, count:325> Sep 11 14:58:24.735833 osafimmnd [9565:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2020f Sep 11 14:58:24.735921 osafimmnd [9565:immnd_evt.c:2429] TR Messages pending:2 Sep 11 14:58:24.735937 osafimmnd [9565:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_IMPLSET_REQ to 0 Sep 11 14:58:24.738250 osafimmnd [9565:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_IMM_OI_INIT (4) from 2020f Sep 11 14:58:24.738290 osafimmnd [9565:ImmModel.cc:3465] >> accessControlMode Sep 11 14:58:24.738313 osafimmnd [9565:ImmModel.cc:3482] << accessControlMode: 0 Sep 11 14:58:24.738329 osafimmnd [9565:immnd_evt.c:0818] T2 Added client with id: 1460002020f <node:2020f, count:326>


5. CLM does implementerset the first OI initialize should have reached

Sep 11 14:58:24.735833 osafimmnd [9565:immsv_evt.c:5382] T8 Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2020f Sep 11 14:58:24.735921 osafimmnd [9565:immnd_evt.c:2429] TR Messages pending:2 Sep 11 14:58:24.735937 osafimmnd [9565:immsv_evt.c:5363] T8 Sending: IMMD_EVT_ND2D_IMPLSET_REQ to 0

Sep 11 14:58:24.743689 osafimmnd [9565:immnd_evt.c:8502] >> immnd_evt_proc_fevs_rcv Sep 11 14:58:24.743729 osafimmnd [9565:immnd_evt.c:8518] T2 FEVS from myself, still pending:2 Sep 11 14:58:24.743750 osafimmnd [9565:immsv_evt.c:5382] T8 Received: IMMND_EVT_D2ND_IMPLSET_RSP (60) from 0 Sep 11 14:58:24.743765 osafimmnd [9565:immnd_evt.c:9138] T2 originated here?:1 nodeId:2020f conn: 324
Sep 11 14:58:24.743776 osafimmnd [9565:ImmModel.cc:12116] >> implementerSet
Sep 11 14:58:24.743797 osafimmnd [9565:ImmModel.cc:12157] T7 Re-using implementer for safClmService Sep 11 14:58:24.743842 osafimmnd [9565:ImmModel.cc:12238] NO Implementer connected: 30 (safClmService) <324, 2020f>
Sep 11 14:58:24.743883 osafimmnd [9565:ImmModel.cc:12302] << implementerSet

6.CLM tries to do saImmOiClassImplementerSet

Sep 11 14:58:24.744013 osafclmd [9606:imma_oi_api.c:1636] T2 ERR_BAD_HANDLE: No implementer is set for this handle Sep 11 14:58:24.744055 osafclmd [9606:clms_imm.c:2291] ER saImmOiClassImplementerSet failed for class SaClmNode rc:9, exiting


Conclusion:

Each time RTobjectupdate is failed CLM is trying to re-init with IMM. In this case it tries to re-init for three times and three threads are created for each re-init. Oihandle is shared among three threads. since CLMD clm_imm_reinit_thread is using shared variables for (cb->immOiHandle) for CLM handle, the handle got assigned to latest OIinitialize (which is Added client with id: 1460002020f <node:2020f, count:326>) . After succeding implementerset in the first thread it tries to set saImmOiClassImplementerSet with OIhandle of latest thread (thread3). so, ERR_BAD_HANDLE is returned.

CLM Before retrying another RTobjectupdate it has to wait untillthe IMM-reinit is succededs(instead of spawning multiple threads).

Thanks and regards,
Neel.

On Thursday 11 September 2014 03:15 PM, Sirisha Alla wrote:

Neel,

I have seen another issue after a switchover. This time CLM failed with
BAD_HANDLE on classImplSet(). Is this issue also resolved by #946 or
should we open a new ticket for it?

Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer locally
disconnected. Marking it as doomed 26 <5, 2020f> (safClmService)
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmpbed: IN Delaying class delete
at slave PBE due to ongoing commit of ccb:15/21
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
disconnected 26 <5, 2020f> (safClmService)
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
connected: 30 (safClmService) <324, 2020f>
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafclmd[9606]: ER
saImmOiClassImplementerSet failed for class SaClmNode rc:9, exiting
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: NO
'safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
'avaDown' : Recovery is 'nodeFailfast'
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: ER
safComp=CLM,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due to:avaDown
Recovery is:nodeFailfast
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafamfnd[9639]: Rebooting OpenSAF
NodeId = 131599 EE Name = , Reason: Component faulted: recovery is node
failfast, OwnNodeId = 131599, SupervisionTime = 60
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer locally
disconnected. Marking it as doomed 30 <324, 2020f> (safClmService)
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: WA IMMND - Client went
down so no response
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: WA IMMND - Client went
down so no response
Sep 11 14:58:24 SLES-64BIT-SLOT2 osafimmnd[9565]: NO Implementer
disconnected 30 <324, 2020f> (safClmService)
Sep 11 14:58:24 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60

clmd traces on that node

Sep 11 14:58:24.715267 osafclmd [9606:clms_imm.c:0530] TR Implementer
not yet set: Switching on the tryagain flag
Sep 11 14:58:24.715296 osafclmd [9606:clms_imm.c:0533] <<
clms_node_update_rattr
Sep 11 14:58:24.715319 osafclmd [9606:clms_imm.c:0449] >>
clms_admin_state_update_rattr: Admin state 1 update for node
safNode=SC-1,safCluster=myClmCluster
Sep 11 14:58:24.715343 osafclmd [9606:clms_util.c:0036] >>
clms_node_get_by_name: name input safNode=SC-1,safCluster=myClmCluster
length 36
Sep 11 14:58:24.715369 osafclmd [9606:clms_util.c:0044] TR nodename
after patricia tree get safNode=SC-1,safCluster=myClmCluster
Sep 11 14:58:24.715393 osafclmd [9606:clms_util.c:0047] <<
clms_node_get_by_name
Sep 11 14:58:24.715421 osafclmd [9606:imma_oi_api.c:2275] T2
ERR_BAD_HANDLE: No initialized handle exists!
Sep 11 14:58:24.715452 osafclmd [9606:clms_imm.c:0488] IN
saImmOiRtObjectUpdate for safNode=SC-1,safCluster=myClmCluster failed
with rc = 9. Reinit with IMM
Sep 11 14:58:24.715480 osafclmd [9606:imma_oi_api.c:0622] >> saImmOiFinalize
Sep 11 14:58:24.715505 osafclmd [9606:imma_oi_api.c:0626] T2
ERR_BAD_HANDLE: No initialized handle exists!

....

Sep 11 14:58:24.719788 osafclmd [9606:clms_imm.c:0721] TR Implementer is
not set. Switching on flag in clms_cluster_update_rattr
Sep 11 14:58:24.719801 osafclmd [9606:clms_imm.c:0724] <<
clms_cluster_update_rattr
Sep 11 14:58:24.719815 osafclmd [9606:clms_imm.c:0633] <<
clms_retry_pending_rtupdates
Sep 11 14:58:24.719828 osafclmd [9606:clms_main.c:0435] TR There is an
IMM task to be tried again. setting poll time out to 500

.....
Sep 11 14:58:24.740335 osafclmd [9606:clms_imm.c:0620] <<
clms_retry_pending_rtupdates: Implementerset yet to happen, try later
Sep 11 14:58:24.740345 osafclmd [9606:clms_main.c:0435] TR There is an
IMM task to be tried again. setting poll time out to 500
Sep 11 14:58:24.744013 osafclmd [9606:imma_oi_api.c:1636] T2
ERR_BAD_HANDLE: No implementer is set for this handle
Sep 11 14:58:24.744055 osafclmd [9606:clms_imm.c:2291] ER
saImmOiClassImplementerSet failed for class SaClmNode rc:9, exiting

Regards,
Sirisha

On Wednesday 10 September 2014 09:28 PM, Neelakanta Reddy wrote:

    A. SLOT1 node went down:

     1. CLM got BAD_HANDLE and finalizes the handle

    Sep 10 14:56:51.543332 osafclmd [7511:imma_oi_api.c:0622] >>
    saImmOiFinalize
    Sep 10 14:56:51.543370 osafclmd [7511:imma_oi_api.c:0626] T2
    ERR_BAD_HANDLE: No initialized handle exists!

     1. Discard implementer is called

    Sep 10 14:56:51.538179 osafimmnd [7448:immsv_evt.c:5363] T8
    Sending: IMMD_EVT_ND2D_DISCARD_IMPL to 0
    Sep 10 14:56:51.539878 osafimmnd [7448:ImmModel.cc:11474] >>
    discardImplementer
    Sep 10 14:56:51.539994 osafimmnd [7448:ImmModel.cc:11510] NO
    Implementer locally disconnected. Marking it as doomed 190 <17,
    2010f> (safClmService)
    Sep 10 14:56:51.540181 osafimmnd [7448:ImmModel.cc:11534] <<
    discardImplementer

     1. But the implemnter actually got disconnected at

    Sep 10 14:56:51.580449 osafimmnd [7448:immnd_evt.c:8588] T2 Global
    discard implementer for id:190
    Sep 10 14:56:51.580462 osafimmnd [7448:ImmModel.cc:11474] >>
    discardImplementer
    Sep 10 14:56:51.580496 osafimmnd [7448:ImmModel.cc:11481] NO
    Implementer disconnected 190 <17, 2010f> (safClmService)
    Sep 10 14:56:51.580518 osafimmnd [7448:ImmModel.cc:11534] <<
    discardImplementer

     1. CLM tries to re-initializes and receives ERR_EXISTS

    Sep 10 14:56:51.551900 osafclmd [7511:imma_oi_api.c:0440] <<
    saImmOiSelectionObjectGet
    Sep 10 14:56:51.551942 osafclmd [7511:clms_imm.c:2286] ER
    saImmOiImplementerSet failed rc:14, exiting
    Sep 10 14:59:51.245982 osafclmd [2538:clms_main.c:0267] >> clms_init

    Sep 10 14:56:51.548981 osafimmnd [7448:immsv_evt.c:5382] T8
    Received: IMMND_EVT_A2ND_OI_IMPL_SET (40) from 2010f
    Sep 10 14:56:51.549023 osafimmnd [7448:immnd_evt.c:2471] T2
    SENDRSP FAIL 14


      946 fixes the above problem in CLM

    B. Slot2 node went down(Quiesced --> Active)

    1.

        Sep 10 14:56:57.681152 osafamfd [6896:role.cc:0375] NO
        FAILOVER Quiesced --> Active

    2.

        saImmOiRtObjectUpdate_2 got BAD_HANDLE so AMFD tries to
        re-initialize with IMM and calls avd_imm_reinit_bg

    Sep 10 14:56:57.701333 osafamfd [6896:imma_oi_api.c:2279] >>
    saImmOiRtObjectUpdate_2
    Sep 10 14:56:57.701344 osafamfd [6896:imma_oi_api.c:2345] T2
    ERR_BAD_HANDLE: The SaImmOiHandleT is not associated with any
    implementer name
    Sep 10 14:56:57.701353 osafamfd [6896:imma_oi_api.c:2554] <<
    saImmOiRtObjectUpdate_2
    Sep 10 14:56:57.701362 osafamfd [6896:imm.cc:0164] TR BADHANDLE
    Sep 10 14:56:57.701370 osafamfd [6896:imm.cc:1660] >>
    avd_imm_reinit_bg
    Sep 10 14:56:57.701406 osafamfd [6896:imm.cc:1662] NO
    Re-initializing with IMM
    Sep 10 14:56:57.701420 osafamfd [6896:imma_oi_api.c:0622] >>
    saImmOiFinalize

     1. Before the finalize is not completed in clearing the OI
        handle, impl_set is called by AMFD in the function
        avd_role_failover_qsd_actv(calling
        avd_imm_impl_set_task_create). Because of this amfd exited.

    Sep 10 14:56:57.701178 osafamfd [6896:role.cc:0498] <<
    avd_role_failover_qsd_actv

    Sep 10 14:56:57.702256 osafamfd [6896:imm.cc:1215] >> avd_imm_impl_set
    Sep 10 14:56:57.702273 osafamfd [6896:imma_oi_api.c:1281] T4
    ERR_LIBRARY: Overlapping use of IMM OI handle by multiple threads
    Sep 10 14:56:57.703683 osafamfd [6896:imm.cc:1218] ER
    saImmOiImplementerSet failed 2
    Sep 10 14:56:57.703788 osafamfd [6896:imm.cc:1288] ER exiting
    since avd_imm_impl_set failed

    Because of using shared Oihandle, across multiple threads in AMFD
    the saImmOiImplementerSet failed with ERR_LIBRARY.

    ------------------------------------------------------------------------

    *[tickets:#1059] <http://sourceforge.net/p/opensaf/tickets/1059>
    2PBE: cluster reset observed during switchovers*

    *Status:* unassigned
    *Milestone:* 4.3.3
    *Created:* Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
    *Last Updated:* Wed Sep 10, 2014 10:29 AM UTC
    *Owner:* nobody

    The issue is seen on SLES X86. OpenSAF is running with changeset
    5697 with 2PBE with 50k application objects.

    Switchovers with IMM application running is in progress when the
    issue is observed.

    Syslog on SC-1:

    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned
    'safSi=SC-2N,safApp=OpenSAF' ACTIVE to
    'safSu=SC-1,safSg=2N,safApp=OpenSAF'
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer
    disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB)
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting
    distributed PBE commit for PRTA update Ccb:100000063/4294967395
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied
    with OK on attempt to start prepare of ccb:100000063/4294967395
    Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting
    distributed PBE commit for Ccb:6/6
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer
    (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f>
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or
    Immsv (6) replied with transient error on prepare for ccb:6/6
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied
    with OK on attempt to start prepare of ccb:6/6
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting
    distributed PBE commit for PRTA update Ccb:100000064/4294967396
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6
    COMMITTED (SetUp_Ccb)
    Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or
    Immsv (6) replied with transient error on prepare for
    ccb:100000064/4294967396
    Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or
    Immsv (6) replied with transient error on prepare for
    ccb:100000064/4294967396
    Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or
    Immsv (6) replied with transient error on prepare for
    ccb:100000064/4294967396
    Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or
    Immsv (6) replied with transient error on prepare for
    ccb:100000064/4294967396
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or
    Immsv (6) replied with transient error on prepare for
    ccb:100000064/4294967396
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or
    Immsv (6) replied with transient error on prepare for
    ccb:100000064/4294967396
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for
    ccb: 100000064/4294967396 towards slave PBE returned: '6' from
    sttandby PBE
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to
    prepare PRTA update Ccb:100000064/4294967396 towards PBE-B
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20)
    in PRTA update (ccbId:100000064)
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of
    PERSISTENT runtime attributes in object
    'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer
    locally disconnected. Marking it as doomed 190 <17, 2010f>
    (safClmService)
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER
    saImmOiImplementerSet failed rc:14, exiting
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO
    'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
    'avaDown' : Recovery is 'nodeFailfast'
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER
    safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
    to:avaDown Recovery is:nodeFailfast
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting
    OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted:
    recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60
    Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local
    node; timeout=60
    Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer
    disconnected 190 <17, 2010f> (safClmService)
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs
    Update continuation missing! invoc:100
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer
    disconnected 16 <0, 2020f> (@OpenSafImmPBE)
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer
    disconnected 17 <0, 2020f> (OsafImmPbeRt_B)
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on
    syncronous admin operation 108
    Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete
    class towards slave PBE. Library or immsv replied Rc:5 - ignoring

    Syslog on SC-2

    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop
    cluster tracking 5
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop
    failed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role:
    ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF
    NodeId = 131343 EE Name = , Reason: Received Node Down for peer
    controller, OwnNodeId = 131599, SupervisionTime = 60
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on
    active controller f1 detected at standby immd!! f2. Possible failover
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set
    to QUIESCED
    .....
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller
    Failover: Setting role to ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set
    to ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord
    invoke from rda_callback ACTIVE
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord
    elected, resides at 2020f
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE
    configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is
    now the NEW Coord
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord
    broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1'
    left the cluster
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning
    'safSi=SC-2N,safApp=OpenSAF' ACTIVE to
    'safSu=SC-2,safSg=2N,safApp=OpenSAF'
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on
    signal 15
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE
    PERSISTENT RTO mutations received in epoch 18
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in
    persistification of class delete fWiTRoVwpQDAfWNBVtqJ
    ......
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER
    Quiesced --> Active
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc
    NCS_MBCSV_OP_CHG_ROLE 1 failed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer
    connected: 200 (MsgQueueService131343) <410, 2020f>
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer
    locally disconnected. Marking it as doomed 200 <410, 2020f>
    (MsgQueueService131343)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer
    (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer
    disconnected 184 <10, 2020f> (safAmfService)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO
    Re-initializing with IMM
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer
    disconnected 200 <410, 2020f> (MsgQueueService131343)
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER
    saImmOiImplementerSet failed 2
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since
    avd_imm_impl_set failed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director
    unexpectedly crashed
    Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting
    OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest)
    or both AVD down(Vdest) received, OwnNodeId = 131599,
    SupervisionTime = 60
    Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local
    node; timeout=60

    Syslog and traces are attached.

    ------------------------------------------------------------------------

    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/opensaf/tickets/1059/
    <https://sourceforge.net/p/opensaf/tickets/1059>

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    <https://sourceforge.net/auth/subscriptions>

------------------------------------------------------------------------

*[tickets:#1059] <http://sourceforge.net/p/opensaf/tickets/1059> CLM: ERR_EXIST from implementer-set needs to be handled as limited TRY_AGAIN.*

*Status:* unassigned
*Milestone:* 4.3.3
*Created:* Wed Sep 10, 2014 09:57 AM UTC by Sirisha Alla
*Last Updated:* Wed Sep 10, 2014 05:30 PM UTC
*Owner:* nobody

The issue is seen on SLES X86. OpenSAF is running with changeset 5697 with 2PBE with 50k application objects.

Switchovers with IMM application running is in progress when the issue is observed.

Syslog on SC-1:

Sep 10 14:56:47 SLES-64BIT-SLOT1 osafamfnd[7540]: NO Assigned 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF' Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 182 <0, 2020f> (@OpenSafImmReplicatorB) Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000063/4294967395 Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:100000063/4294967395 Sep 10 14:56:47 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for Ccb:6/6 Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer (applier) connected: 193 (@OpenSafImmReplicatorB) <0, 2020f> Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:6/6 Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Slave PBE replied with OK on attempt to start prepare of ccb:6/6 Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: IN Starting distributed PBE commit for PRTA update Ccb:100000064/4294967396 Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Ccb 6 COMMITTED (SetUp_Ccb) Sep 10 14:56:48 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396 Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396 Sep 10 14:56:49 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396 Sep 10 14:56:50 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396 Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396 Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO Slave PBE 1 or Immsv (6) replied with transient error on prepare for ccb:100000064/4294967396 Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA Start prepare for ccb: 100000064/4294967396 towards slave PBE returned: '6' from sttandby PBE Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: WA PBE-A failed to prepare PRTA update Ccb:100000064/4294967396 towards PBE-B Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmpbed: NO 2PBE Error (20) in PRTA update (ccbId:100000064) Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: WA update of PERSISTENT runtime attributes in object 'safNode=PL-4,safCluster=myClmCluster' REVERTED. PBE rc:20 Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer locally disconnected. Marking it as doomed 190 <17, 2010f> (safClmService) Sep 10 14:56:51 SLES-64BIT-SLOT1 osafclmd[7511]: ER saImmOiImplementerSet failed rc:14, exiting Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: NO 'safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: ER safComp=CLM,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast Sep 10 14:56:51 SLES-64BIT-SLOT1 osafamfnd[7540]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60 Sep 10 14:56:51 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60 Sep 10 14:56:51 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 190 <17, 2010f> (safClmService) Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: ER PBE PRTAttrs Update continuation missing! invoc:100 Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 16 <0, 2020f> (@OpenSafImmPBE) Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: NO Implementer disconnected 17 <0, 2020f> (OsafImmPbeRt_B) Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmnd[7448]: WA Timeout on syncronous admin operation 108 Sep 10 14:56:53 SLES-64BIT-SLOT1 osafimmpbed: WA Failed to delete class towards slave PBE. Library or immsv replied Rc:5 - ignoring

Syslog on SC-2

Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER Failed to stop cluster tracking 5
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ClmTrack stop failed
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Current role: ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 131599, SupervisionTime = 60 Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: WA IMMND DOWN on active controller f1 detected at standby immd!! f2. Possible failover Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to QUIESCED
.....
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaffmd[6816]: NO Controller Failover: Setting role to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafrded[6807]: NO RDE role set to ACTIVE
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osaflogd[6850]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfd[6863]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafclmd[6877]: NO ACTIVE request
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO ellect_coord invoke from rda_callback ACTIVE Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmd[6826]: NO New coord elected, resides at 2020f Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO 2PBE configured, IMMSV_PBE_FILE_SUFFIX:.2020f (sync) Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO This IMMND is now the NEW Coord Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Coord broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:18 Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Node 'SC-1' left the cluster Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: NO Assigning 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF' Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7701]: NO exiting on signal 15 Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA Global PURGE PERSISTENT RTO mutations received in epoch 18 Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: WA PBE failed in persistification of class delete fWiTRoVwpQDAfWNBVtqJ
......
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO FAILOVER Quiesced --> Active Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER ncs_mbcsv_svc NCS_MBCSV_OP_CHG_ROLE 1 failed Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer connected: 200 (MsgQueueService131343) <410, 2020f> Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer locally disconnected. Marking it as doomed 200 <410, 2020f> (MsgQueueService131343) Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer (applier) connected: 201 (@OpenSafImmReplicatorA) <412, 2020f>
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafntfimcnd[7722]: NO Started
Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 184 <10, 2020f> (safAmfService) Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: NO Re-initializing with IMM Sep 10 14:56:57 SLES-64BIT-SLOT2 osafimmnd[6836]: NO Implementer disconnected 200 <410, 2020f> (MsgQueueService131343) Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER saImmOiImplementerSet failed 2 Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfd[6896]: ER exiting since avd_imm_impl_set failed Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: ER AMF director unexpectedly crashed Sep 10 14:56:57 SLES-64BIT-SLOT2 osafamfnd[6906]: Rebooting OpenSAF NodeId = 131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received, OwnNodeId = 131599, SupervisionTime = 60 Sep 10 14:56:57 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; timeout=60

Syslog and traces are attached.

------------------------------------------------------------------------

Sent from sourceforge.net because [email protected] is subscribed to http://sourceforge.net/p/opensaf/tickets/ <http://sourceforge.net/p/opensaf/tickets>

To unsubscribe from further messages, a project admin can change settings at http://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.



------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk


_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to