Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V3 [#1621]

A V Mahesh Thu, 10 Mar 2016 02:12:14 -0800

Hi Nhat Pham,

ACK form me :


Tested & reviewed following  :
-----------------------------------------
1)  Application running on payloads and controllers WITHOUT headless 
state  occurrences

2)  Application running on payloads and controllers WITH headless state  
occurrences

3)  Application running on payloads and controllers CPND restart & WITH 
headless state  occurrences

3)  In service upgrade  while application running on payload  WITH 
headless state  occurrences
     (Active controller old without  #1621 patch ,  Standby controller 
new with #1621  patch  and 2 payloads old  without  #1621 patch)

4) In service upgrade  while application running on Standby/Active 
new/old Controllers  WITH headless state  occurrences
    (Active controller old without  #1621 patch ,  Standby controller 
new with #1621  patch  and 2 payloads old  without  #1621 patch)

No major issue observed except some `saImmOiRtObjectCreate_2 failed with 
error = 14`

=================================================================================
Mar 10 15:22:57 SC-1 osafamfd[15219]: NO Node 'PL-3' joined the cluster
Mar 10 15:23:12 SC-1 osafckptd[15382]: ER saImmOiRtObjectCreate_2 failed 
with error = 14
Mar 10 15:23:12 SC-1 osafckptd[15382]: ER create runtime ckpt object 
failed with error: 14
=================================================================================

Not yet tested  :
-----------------------------------------

1)  Ckpt Data content restoration after headless state  occurrences

Note :  Observed some issue with  dataSize in the ioVector and this 
testing i will   contentiou ,
if we find any issue  we can fix any issue found in CKPT data restore.

=================================================================================
520|0 304 13304835 1 68| Verifying IO Vector returned in the 
CheckpointTrackCallback
520|0 304 13304835 1 69| Size of the data written 13 and the dataSize in 
the ioVector 46 doesnot match
=================================================================================

General note :
---------------------
Please add README in CPSV folder .

-AVM
>
>
> On 3/3/2016 3:41 PM, Nhat Pham wrote:
>> Hi Mahesh,
>>
>> Have you reviewed the patch?
>>
>> Best regards,
>> Nhat Pham
>>
>> -----Original Message-----
>> From: A V Mahesh [mailto:[email protected]]
>> Sent: Monday, February 29, 2016 1:15 PM
>> To: Nhat Pham <[email protected]>; [email protected]
>> Cc: [email protected]
>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
>> preserving and recovering checkpoint replicas during headless state 
>> V3 [#1621]
>>
>> Hi Nhat Pham,
>>
>> I will review V3 patch  and do the final functional testing and get 
>> back to
>> you soon.
>> ( I may take some time , I also need to work on my published MDS
>> enhancements )
>>
>> -AVM
>>
>>
>> On 2/29/2016 9:39 AM, Nhat Pham wrote:
>>> Hi,
>>>
>>> Following is the summary of updating in V3:
>>>
>>> Comment 1: This functionality should be under checks if Hydra
>>> configuration is enabled in IMM attrName =
>>> const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>>>
>>> Status: Included in V3
>>>
>>> Comment 2: To keep the scope of CPSV service as non-collocated
>>> checkpoint creation NOT_SUPPORTED , if cluster is running with
>>> IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at the
>>> time of cluster startup currently it is not configurable , so there no
>>> chance of  run-time configuration change ).
>>>
>>> Status: No change in code. The CPSV still keep supporting
>>> non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is enable.
>>>
>>> Comment 3: This is about case where checkpoint node director (cpnd)
>>> crashes during headless state. In this case the cpnd can't finish
>>> starting because it can't initialize CLM service.
>>> Then after time out, the AMF triggers a restart again. Finally, the
>>> node is rebooted.
>>> It is expected that this problem should not lead to a node reboot.
>>>
>>> Status: Included in V3. CPND reinitializes CLM service if the fault
>>> TRY_AGAIN is returned.
>>>
>>> Comment 4: The Suggestion was to re-create the checkpoint without any
>>> sections in case the all replicas is lost. If the sections were
>>> re-created, the application wouldn't know that data has been lost. I
>>> think the BAD_HANDLE approach is okay since we have used it in other
>>> services, but I see it as kind of a hack solution that is not really 
>>> in line
>>> with the specs.
>>> The specs never intended BAD_HANDLE to be something that can happen
>>> spontaneously on a previously valid handle, lest you are suffering
>>> from memory corruption. In the future we could consider the
>>> feasibility of avoiding spontaneous BAD_HANDLE where possible, and in
>>> CKPT I think it may be possible by re-creating the checkpoints.
>>>
>>> Status: NOT included in V3.
>>> This change is quite much and requires a detailed design in different
>>> scenarios. I would suggest to create an enhancement ticket for this.
>>> How would you think?
>>>
>>> Best regards,
>>> Nhat Pham
>>>
>>> -----Original Message-----
>>> From: Nhat Pham [mailto:[email protected]]
>>> Sent: Monday, February 29, 2016 11:06 AM
>>> To: [email protected]; [email protected]
>>> Cc: [email protected]
>>> Subject: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
>>> preserving and recovering checkpoint replicas during headless state V3
>>> [#1621]
>>>
>>> Summary: cpsv: Support preserving and recovering checkpoint replicas
>>> during headless state V3 [#1621] Review request for Trac Ticket(s):
>>> 1621 Peer
>>> Reviewer(s): [email protected]; [email protected] Pull
>>> request to: [email protected] Affected branch(es): default
>>> Development
>>> branch: default
>>>
>>> --------------------------------
>>> Impacted area       Impact y/n
>>> --------------------------------
>>>    Docs                    n
>>>    Build system            n
>>>    RPM/packaging           n
>>>    Configuration files     n
>>>    Startup scripts         n
>>>    SAF services            y
>>>    OpenSAF services        n
>>>    Core libraries          n
>>>    Samples                 n
>>>    Tests                   n
>>>    Other                   n
>>>
>>>
>>> Comments (indicate scope for each "y" above):
>>> ---------------------------------------------
>>>
>>> changeset 8559fe4cea27efc8234f7cf779f3c7413efcd40f
>>> Author:    Nhat Pham <[email protected]>
>>> Date:    Mon, 29 Feb 2016 11:02:15 +0700
>>>
>>>     cpsv: Support preserving and recovering checkpoint replicas during
>>> headless state V3 [#1621]
>>>
>>>     Background:
>>>     ----------
>>>     This enhancement supports to preserve checkpoint replicas in case
>>>     both SCs down (headless state) and recover replicas in case one of
>>> SCs up
>>>     again. If both SCs goes down, checkpoint replicas on surviving 
>>> nodes
>>> still
>>>     remain. When a SC is available again, surviving replicas are
>>> automatically
>>>     registered to the SC checkpoint database. Content in surviving
>>> replicas are
>>>     intacted and synchronized to new replicas.
>>>
>>>     When no SC is available, client API calls changing checkpoint
>>> configuration
>>>     which requires SC communication, are rejected. Client API calls
>>> reading and
>>>     writing existing checkpoint replicas still work.
>>>
>>>     Limitation: The CKPT service does not support recovering 
>>> checkpoints
>>> in
>>>     following cases:
>>>      - The checkpoint which is unlinked before headless.
>>>      - The non-collocated checkpoint has active replica locating on SC.
>>>      - The non-collocated checkpoint has active replica locating on 
>>> a PL
>>> and this
>>>     PL restarts during headless state. In this cases, the checkpoint
>>> replica is
>>>     destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when 
>>> the
>>> client
>>>     accesses the checkpoint in these cases. The client must re-open the
>>>     checkpoint.
>>>
>>>     While in headless state, accessing checkpoint replicas does not 
>>> work
>>> if the
>>>     node which hosts the active replica goes down. It will back working
>>> when a
>>>     SC available again.
>>>
>>>     Solution:
>>>     ---------
>>>     The solution for this enhancement includes 2 parts:
>>>
>>>     1. To destroy un-recoverable checkpoint described above when 
>>> both SCs
>>> are
>>>     down: When both SCs are down, the CPND deletes un-recoverable
>>> checkpoint
>>>     nodes and replicas on PLs. Then it requests CPA to destroy
>>> corresponding
>>>     checkpoint node by using new message CPA_EVT_ND2A_CKPT_DESTROY
>>>
>>>     2. To update CPD with checkpoint information When an active SC 
>>> is up
>>> after
>>>     headless, CPND will update CPD with checkpoint information by using
>>> new
>>>     message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using
>>>     CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will create new
>>> ckpt_id
>>>     for the checkpoint which might be different with the current 
>>> ckpt id
>>> if the
>>>     CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint
>>> information
>>>     within 6s. During this updating time, following requests is 
>>> rejected
>>> with
>>>     fault code SA_AIS_ERR_TRY_AGAIN:
>>>     - CPD_EVT_ND2D_CKPT_CREATE
>>>     - CPD_EVT_ND2D_CKPT_UNLINK
>>>     - CPD_EVT_ND2D_ACTIVE_SET
>>>     - CPD_EVT_ND2D_CKPT_RDSET
>>>
>>>
>>> Complete diffstat:
>>> ------------------
>>>    osaf/libs/agents/saf/cpa/cpa_proc.c       |   52
>>> ++++++++++++++++++++++++++
>>>    osaf/libs/common/cpsv/cpsv_edu.c          |   43 
>>> +++++++++++++++++++++
>>>    osaf/libs/common/cpsv/include/cpd_cb.h    |    4 ++
>>>    osaf/libs/common/cpsv/include/cpd_imm.h   |    2 +
>>>    osaf/libs/common/cpsv/include/cpd_proc.h  |    7 +++
>>>    osaf/libs/common/cpsv/include/cpd_tmr.h   |    3 +-
>>>    osaf/libs/common/cpsv/include/cpnd_cb.h   |    3 +
>>>    osaf/libs/common/cpsv/include/cpnd_init.h |    3 +
>>>    osaf/libs/common/cpsv/include/cpsv_evt.h  |   20 ++++++++++
>>>    osaf/services/saf/cpsv/cpd/Makefile.am    |    3 +-
>>>    osaf/services/saf/cpsv/cpd/cpd_evt.c      |  229
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  
>>>
>>> ++++++++++++++++++++++++++++++++++++++
>>>    osaf/services/saf/cpsv/cpd/cpd_imm.c      |  202
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  
>>>
>>> +++++++++++++++++++++++++
>>>    osaf/services/saf/cpsv/cpd/cpd_init.c     |   26 ++++++++++++-
>>>    osaf/services/saf/cpsv/cpd/cpd_proc.c     |  309
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  
>>>
>>> ++
>>>    osaf/services/saf/cpsv/cpd/cpd_tmr.c      |    7 +++
>>>    osaf/services/saf/cpsv/cpnd/Makefile.am   |    6 ++-
>>>    osaf/services/saf/cpsv/cpnd/cpnd_db.c     |   16 ++++++++
>>>    osaf/services/saf/cpsv/cpnd/cpnd_evt.c    |   24 ++++++++++++
>>>    osaf/services/saf/cpsv/cpnd/cpnd_init.c   |   34 ++++++++++++++++-
>>>    osaf/services/saf/cpsv/cpnd/cpnd_mds.c    |   13 ++++++
>>>    osaf/services/saf/cpsv/cpnd/cpnd_proc.c   |  429
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>    21 files changed, 1423 insertions(+), 12 deletions(-)
>>>
>>>
>>> Testing Commands:
>>> -----------------
>>> -
>>>
>>> Testing, Expected Results:
>>> --------------------------
>>> -
>>>
>>>
>>> Conditions of Submission:
>>> -------------------------
>>> -
>>>
>>>
>>> Arch      Built     Started    Linux distro
>>> -------------------------------------------
>>> mips        n          n
>>> mips64      n          n
>>> x86         n          n
>>> x86_64      y          y
>>> powerpc     n          n
>>> powerpc64   n          n
>>>
>>>
>>> Reviewer Checklist:
>>> -------------------
>>> [Submitters: make sure that your review doesn't trigger any
>>> checkmarks!]
>>>
>>>
>>> Your checkin has not passed review because (see checked entries):
>>>
>>> ___ Your RR template is generally incomplete; it has too many blank 
>>> entries
>>>       that need proper data filled in.
>>>
>>> ___ You have failed to nominate the proper persons for review and push.
>>>
>>> ___ Your patches do not have proper short+long header
>>>
>>> ___ You have grammar/spelling in your header that is unacceptable.
>>>
>>> ___ You have exceeded a sensible line length in your 
>>> headers/comments/text.
>>>
>>> ___ You have failed to put in a proper Trac Ticket # into your commits.
>>>
>>> ___ You have incorrectly put/left internal data in your comments/files
>>>       (i.e. internal bug tracking tool IDs, product names etc)
>>>
>>> ___ You have not given any evidence of testing beyond basic build 
>>> tests.
>>>       Demonstrate some level of runtime or other sanity testing.
>>>
>>> ___ You have ^M present in some of your files. These have to be 
>>> removed.
>>>
>>> ___ You have needlessly changed whitespace or added whitespace crimes
>>>       like trailing spaces, or spaces before tabs.
>>>
>>> ___ You have mixed real technical changes with whitespace and other
>>>       cosmetic code cleanup changes. These have to be separate commits.
>>>
>>> ___ You need to refactor your submission into logical chunks; there is
>>>       too much content into a single commit.
>>>
>>> ___ You have extraneous garbage in your review (merge commits etc)
>>>
>>> ___ You have giant attachments which should never have been sent;
>>>       Instead you should place your content in a public tree to be 
>>> pulled.
>>>
>>> ___ You have too many commits attached to an e-mail; resend as threaded
>>>       commits, or place in a public tree for a pull.
>>>
>>> ___ You have resent this content multiple times without a clear 
>>> indication
>>>       of what has changed between each re-send.
>>>
>>> ___ You have failed to adequately and individually address all of the
>>>       comments and change requests that were proposed in the initial 
>>> review.
>>>
>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>>>
>>> ___ Your computer have a badly configured date and time; confusing the
>>>       the threaded patch review.
>>>
>>> ___ Your changes affect IPC mechanism, and you don't present any 
>>> results
>>>       for in-service upgradability test.
>>>
>>> ___ Your changes affect user manual and documentation, your patch 
>>> series
>>>       do not contain the patch that updates the Doxygen manual.
>>>
>>>
>>> ----------------------------------------------------------------------
>>> ------
>>> -- 
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>> _______________________________________________
>>> Opensaf-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>
>>>
>>
>


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V3 [#1621]

Reply via email to