Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V3 [#1621]

A V Mahesh Thu, 03 Mar 2016 02:26:25 -0800

Hi Nhat Pham,

I am working on  `MDS:TIPC include node name as a part of callback_info 
events [#1522]`
i will start as soon as this is pushed .


-AVM


On 3/3/2016 3:41 PM, Nhat Pham wrote:
> Hi Mahesh,
>
> Have you reviewed the patch?
>
> Best regards,
> Nhat Pham
>
> -----Original Message-----
> From: A V Mahesh [mailto:[email protected]]
> Sent: Monday, February 29, 2016 1:15 PM
> To: Nhat Pham <[email protected]>; [email protected]
> Cc: [email protected]
> Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
> preserving and recovering checkpoint replicas during headless state V3 [#1621]
>
> Hi Nhat Pham,
>
> I will review V3 patch  and do the final functional testing and get back to
> you soon.
> ( I may take some time , I also need to work on my published MDS
> enhancements )
>
> -AVM
>
>
> On 2/29/2016 9:39 AM, Nhat Pham wrote:
>> Hi,
>>
>> Following is the summary of updating in V3:
>>
>> Comment 1: This functionality should be under checks if Hydra
>> configuration is enabled in IMM attrName =
>> const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>>
>> Status: Included in V3
>>
>> Comment 2: To keep the scope of CPSV service as non-collocated
>> checkpoint creation NOT_SUPPORTED , if cluster is running with
>> IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at the
>> time of cluster startup currently it is not configurable , so there no
>> chance of  run-time configuration change ).
>>
>> Status: No change in code. The CPSV still keep supporting
>> non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is enable.
>>
>> Comment 3: This is about case where checkpoint node director (cpnd)
>> crashes during headless state. In this case the cpnd can't finish
>> starting because it can't initialize CLM service.
>> Then after time out, the AMF triggers a restart again. Finally, the
>> node is rebooted.
>> It is expected that this problem should not lead to a node reboot.
>>
>> Status: Included in V3. CPND reinitializes CLM service if the fault
>> TRY_AGAIN is returned.
>>
>> Comment 4: The Suggestion was to re-create the checkpoint without any
>> sections in case the all replicas is lost. If the sections were
>> re-created, the application wouldn't know that data has been lost. I
>> think the BAD_HANDLE approach is okay since we have used it in other
>> services, but I see it as kind of a hack solution that is not really in line
>> with the specs.
>> The specs never intended BAD_HANDLE to be something that can happen
>> spontaneously on a previously valid handle, lest you are suffering
>> from memory corruption. In the future we could consider the
>> feasibility of avoiding spontaneous BAD_HANDLE where possible, and in
>> CKPT I think it may be possible by re-creating the checkpoints.
>>
>> Status: NOT included in V3.
>> This change is quite much and requires a detailed design in different
>> scenarios. I would suggest to create an enhancement ticket for this.
>> How would you think?
>>
>> Best regards,
>> Nhat Pham
>>
>> -----Original Message-----
>> From: Nhat Pham [mailto:[email protected]]
>> Sent: Monday, February 29, 2016 11:06 AM
>> To: [email protected]; [email protected]
>> Cc: [email protected]
>> Subject: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
>> preserving and recovering checkpoint replicas during headless state V3
>> [#1621]
>>
>> Summary: cpsv: Support preserving and recovering checkpoint replicas
>> during headless state V3 [#1621] Review request for Trac Ticket(s):
>> 1621 Peer
>> Reviewer(s): [email protected]; [email protected] Pull
>> request to: [email protected] Affected branch(es): default
>> Development
>> branch: default
>>
>> --------------------------------
>> Impacted area       Impact y/n
>> --------------------------------
>>    Docs                    n
>>    Build system            n
>>    RPM/packaging           n
>>    Configuration files     n
>>    Startup scripts         n
>>    SAF services            y
>>    OpenSAF services        n
>>    Core libraries          n
>>    Samples                 n
>>    Tests                   n
>>    Other                   n
>>
>>
>> Comments (indicate scope for each "y" above):
>> ---------------------------------------------
>>
>> changeset 8559fe4cea27efc8234f7cf779f3c7413efcd40f
>> Author:      Nhat Pham <[email protected]>
>> Date:        Mon, 29 Feb 2016 11:02:15 +0700
>>
>>      cpsv: Support preserving and recovering checkpoint replicas during
>> headless state V3 [#1621]
>>
>>      Background:
>>      ----------
>>      This enhancement supports to preserve checkpoint replicas in case
>>      both SCs down (headless state) and recover replicas in case one of
>> SCs up
>>      again. If both SCs goes down, checkpoint replicas on surviving nodes
>> still
>>      remain. When a SC is available again, surviving replicas are
>> automatically
>>      registered to the SC checkpoint database. Content in surviving
>> replicas are
>>      intacted and synchronized to new replicas.
>>
>>      When no SC is available, client API calls changing checkpoint
>> configuration
>>      which requires SC communication, are rejected. Client API calls
>> reading and
>>      writing existing checkpoint replicas still work.
>>
>>      Limitation: The CKPT service does not support recovering checkpoints
>> in
>>      following cases:
>>       - The checkpoint which is unlinked before headless.
>>       - The non-collocated checkpoint has active replica locating on SC.
>>       - The non-collocated checkpoint has active replica locating on a PL
>> and this
>>      PL restarts during headless state. In this cases, the checkpoint
>> replica is
>>      destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when the
>> client
>>      accesses the checkpoint in these cases. The client must re-open the
>>      checkpoint.
>>
>>      While in headless state, accessing checkpoint replicas does not work
>> if the
>>      node which hosts the active replica goes down. It will back working
>> when a
>>      SC available again.
>>
>>      Solution:
>>      ---------
>>      The solution for this enhancement includes 2 parts:
>>
>>      1. To destroy un-recoverable checkpoint described above when both SCs
>> are
>>      down: When both SCs are down, the CPND deletes un-recoverable
>> checkpoint
>>      nodes and replicas on PLs. Then it requests CPA to destroy
>> corresponding
>>      checkpoint node by using new message CPA_EVT_ND2A_CKPT_DESTROY
>>
>>      2. To update CPD with checkpoint information When an active SC is up
>> after
>>      headless, CPND will update CPD with checkpoint information by using
>> new
>>      message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using
>>      CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will create new
>> ckpt_id
>>      for the checkpoint which might be different with the current ckpt id
>> if the
>>      CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint
>> information
>>      within 6s. During this updating time, following requests is rejected
>> with
>>      fault code SA_AIS_ERR_TRY_AGAIN:
>>      - CPD_EVT_ND2D_CKPT_CREATE
>>      - CPD_EVT_ND2D_CKPT_UNLINK
>>      - CPD_EVT_ND2D_ACTIVE_SET
>>      - CPD_EVT_ND2D_CKPT_RDSET
>>
>>
>> Complete diffstat:
>> ------------------
>>    osaf/libs/agents/saf/cpa/cpa_proc.c       |   52
>> ++++++++++++++++++++++++++
>>    osaf/libs/common/cpsv/cpsv_edu.c          |   43 +++++++++++++++++++++
>>    osaf/libs/common/cpsv/include/cpd_cb.h    |    4 ++
>>    osaf/libs/common/cpsv/include/cpd_imm.h   |    2 +
>>    osaf/libs/common/cpsv/include/cpd_proc.h  |    7 +++
>>    osaf/libs/common/cpsv/include/cpd_tmr.h   |    3 +-
>>    osaf/libs/common/cpsv/include/cpnd_cb.h   |    3 +
>>    osaf/libs/common/cpsv/include/cpnd_init.h |    3 +
>>    osaf/libs/common/cpsv/include/cpsv_evt.h  |   20 ++++++++++
>>    osaf/services/saf/cpsv/cpd/Makefile.am    |    3 +-
>>    osaf/services/saf/cpsv/cpd/cpd_evt.c      |  229
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++++++++++++++++++++++++++++++++
>>    osaf/services/saf/cpsv/cpd/cpd_imm.c      |  202
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> +++++++++++++++++++++++++
>>    osaf/services/saf/cpsv/cpd/cpd_init.c     |   26 ++++++++++++-
>>    osaf/services/saf/cpsv/cpd/cpd_proc.c     |  309
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++
>>    osaf/services/saf/cpsv/cpd/cpd_tmr.c      |    7 +++
>>    osaf/services/saf/cpsv/cpnd/Makefile.am   |    6 ++-
>>    osaf/services/saf/cpsv/cpnd/cpnd_db.c     |   16 ++++++++
>>    osaf/services/saf/cpsv/cpnd/cpnd_evt.c    |   24 ++++++++++++
>>    osaf/services/saf/cpsv/cpnd/cpnd_init.c   |   34 ++++++++++++++++-
>>    osaf/services/saf/cpsv/cpnd/cpnd_mds.c    |   13 ++++++
>>    osaf/services/saf/cpsv/cpnd/cpnd_proc.c   |  429
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>    21 files changed, 1423 insertions(+), 12 deletions(-)
>>
>>
>> Testing Commands:
>> -----------------
>> -
>>
>> Testing, Expected Results:
>> --------------------------
>> -
>>
>>
>> Conditions of Submission:
>> -------------------------
>> -
>>
>>
>> Arch      Built     Started    Linux distro
>> -------------------------------------------
>> mips        n          n
>> mips64      n          n
>> x86         n          n
>> x86_64      y          y
>> powerpc     n          n
>> powerpc64   n          n
>>
>>
>> Reviewer Checklist:
>> -------------------
>> [Submitters: make sure that your review doesn't trigger any
>> checkmarks!]
>>
>>
>> Your checkin has not passed review because (see checked entries):
>>
>> ___ Your RR template is generally incomplete; it has too many blank entries
>>       that need proper data filled in.
>>
>> ___ You have failed to nominate the proper persons for review and push.
>>
>> ___ Your patches do not have proper short+long header
>>
>> ___ You have grammar/spelling in your header that is unacceptable.
>>
>> ___ You have exceeded a sensible line length in your headers/comments/text.
>>
>> ___ You have failed to put in a proper Trac Ticket # into your commits.
>>
>> ___ You have incorrectly put/left internal data in your comments/files
>>       (i.e. internal bug tracking tool IDs, product names etc)
>>
>> ___ You have not given any evidence of testing beyond basic build tests.
>>       Demonstrate some level of runtime or other sanity testing.
>>
>> ___ You have ^M present in some of your files. These have to be removed.
>>
>> ___ You have needlessly changed whitespace or added whitespace crimes
>>       like trailing spaces, or spaces before tabs.
>>
>> ___ You have mixed real technical changes with whitespace and other
>>       cosmetic code cleanup changes. These have to be separate commits.
>>
>> ___ You need to refactor your submission into logical chunks; there is
>>       too much content into a single commit.
>>
>> ___ You have extraneous garbage in your review (merge commits etc)
>>
>> ___ You have giant attachments which should never have been sent;
>>       Instead you should place your content in a public tree to be pulled.
>>
>> ___ You have too many commits attached to an e-mail; resend as threaded
>>       commits, or place in a public tree for a pull.
>>
>> ___ You have resent this content multiple times without a clear indication
>>       of what has changed between each re-send.
>>
>> ___ You have failed to adequately and individually address all of the
>>       comments and change requests that were proposed in the initial review.
>>
>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>>
>> ___ Your computer have a badly configured date and time; confusing the
>>       the threaded patch review.
>>
>> ___ Your changes affect IPC mechanism, and you don't present any results
>>       for in-service upgradability test.
>>
>> ___ Your changes affect user manual and documentation, your patch series
>>       do not contain the patch that updates the Doxygen manual.
>>
>>
>> ----------------------------------------------------------------------
>> ------
>> --
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>> _______________________________________________
>> Opensaf-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>
>>
>


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V3 [#1621]

Reply via email to