Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V3 [#1621]

Nhat Pham Thu, 03 Mar 2016 02:12:50 -0800

Hi Mahesh,

Have you reviewed the patch?


Best regards,
Nhat Pham

-----Original Message-----
From: A V Mahesh [mailto:[email protected]]
Sent: Monday, February 29, 2016 1:15 PM
To: Nhat Pham <[email protected]>; [email protected]
Cc: [email protected]
Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support 
preserving and recovering checkpoint replicas during headless state V3 [#1621]

Hi Nhat Pham,

I will review V3 patch  and do the final functional testing and get back to 
you soon.
( I may take some time , I also need to work on my published MDS 
enhancements )

-AVM


On 2/29/2016 9:39 AM, Nhat Pham wrote:
> Hi,
>
> Following is the summary of updating in V3:
>
> Comment 1: This functionality should be under checks if Hydra
> configuration is enabled in IMM attrName =
> const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>
> Status: Included in V3
>
> Comment 2: To keep the scope of CPSV service as non-collocated
> checkpoint creation NOT_SUPPORTED , if cluster is running with
> IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at the
> time of cluster startup currently it is not configurable , so there no
> chance of  run-time configuration change ).
>
> Status: No change in code. The CPSV still keep supporting
> non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is enable.
>
> Comment 3: This is about case where checkpoint node director (cpnd)
> crashes during headless state. In this case the cpnd can't finish
> starting because it can't initialize CLM service.
> Then after time out, the AMF triggers a restart again. Finally, the
> node is rebooted.
> It is expected that this problem should not lead to a node reboot.
>
> Status: Included in V3. CPND reinitializes CLM service if the fault
> TRY_AGAIN is returned.
>
> Comment 4: The Suggestion was to re-create the checkpoint without any
> sections in case the all replicas is lost. If the sections were
> re-created, the application wouldn't know that data has been lost. I
> think the BAD_HANDLE approach is okay since we have used it in other
> services, but I see it as kind of a hack solution that is not really in line 
> with the specs.
> The specs never intended BAD_HANDLE to be something that can happen
> spontaneously on a previously valid handle, lest you are suffering
> from memory corruption. In the future we could consider the
> feasibility of avoiding spontaneous BAD_HANDLE where possible, and in
> CKPT I think it may be possible by re-creating the checkpoints.
>
> Status: NOT included in V3.
> This change is quite much and requires a detailed design in different
> scenarios. I would suggest to create an enhancement ticket for this.
> How would you think?
>
> Best regards,
> Nhat Pham
>
> -----Original Message-----
> From: Nhat Pham [mailto:[email protected]]
> Sent: Monday, February 29, 2016 11:06 AM
> To: [email protected]; [email protected]
> Cc: [email protected]
> Subject: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
> preserving and recovering checkpoint replicas during headless state V3
> [#1621]
>
> Summary: cpsv: Support preserving and recovering checkpoint replicas
> during headless state V3 [#1621] Review request for Trac Ticket(s):
> 1621 Peer
> Reviewer(s): [email protected]; [email protected] Pull
> request to: [email protected] Affected branch(es): default
> Development
> branch: default
>
> --------------------------------
> Impacted area       Impact y/n
> --------------------------------
>   Docs                    n
>   Build system            n
>   RPM/packaging           n
>   Configuration files     n
>   Startup scripts         n
>   SAF services            y
>   OpenSAF services        n
>   Core libraries          n
>   Samples                 n
>   Tests                   n
>   Other                   n
>
>
> Comments (indicate scope for each "y" above):
> ---------------------------------------------
>
> changeset 8559fe4cea27efc8234f7cf779f3c7413efcd40f
> Author:       Nhat Pham <[email protected]>
> Date: Mon, 29 Feb 2016 11:02:15 +0700
>
>       cpsv: Support preserving and recovering checkpoint replicas during
> headless state V3 [#1621]
>
>       Background:
>       ----------
>       This enhancement supports to preserve checkpoint replicas in case
>       both SCs down (headless state) and recover replicas in case one of
> SCs up
>       again. If both SCs goes down, checkpoint replicas on surviving nodes
> still
>       remain. When a SC is available again, surviving replicas are
> automatically
>       registered to the SC checkpoint database. Content in surviving
> replicas are
>       intacted and synchronized to new replicas.
>
>       When no SC is available, client API calls changing checkpoint
> configuration
>       which requires SC communication, are rejected. Client API calls
> reading and
>       writing existing checkpoint replicas still work.
>
>       Limitation: The CKPT service does not support recovering checkpoints
> in
>       following cases:
>        - The checkpoint which is unlinked before headless.
>        - The non-collocated checkpoint has active replica locating on SC.
>        - The non-collocated checkpoint has active replica locating on a PL
> and this
>       PL restarts during headless state. In this cases, the checkpoint
> replica is
>       destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when the
> client
>       accesses the checkpoint in these cases. The client must re-open the
>       checkpoint.
>
>       While in headless state, accessing checkpoint replicas does not work
> if the
>       node which hosts the active replica goes down. It will back working
> when a
>       SC available again.
>
>       Solution:
>       ---------
>       The solution for this enhancement includes 2 parts:
>
>       1. To destroy un-recoverable checkpoint described above when both SCs
> are
>       down: When both SCs are down, the CPND deletes un-recoverable
> checkpoint
>       nodes and replicas on PLs. Then it requests CPA to destroy
> corresponding
>       checkpoint node by using new message CPA_EVT_ND2A_CKPT_DESTROY
>
>       2. To update CPD with checkpoint information When an active SC is up
> after
>       headless, CPND will update CPD with checkpoint information by using
> new
>       message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using
>       CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will create new
> ckpt_id
>       for the checkpoint which might be different with the current ckpt id
> if the
>       CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint
> information
>       within 6s. During this updating time, following requests is rejected
> with
>       fault code SA_AIS_ERR_TRY_AGAIN:
>       - CPD_EVT_ND2D_CKPT_CREATE
>       - CPD_EVT_ND2D_CKPT_UNLINK
>       - CPD_EVT_ND2D_ACTIVE_SET
>       - CPD_EVT_ND2D_CKPT_RDSET
>
>
> Complete diffstat:
> ------------------
>   osaf/libs/agents/saf/cpa/cpa_proc.c       |   52 
> ++++++++++++++++++++++++++
>   osaf/libs/common/cpsv/cpsv_edu.c          |   43 +++++++++++++++++++++
>   osaf/libs/common/cpsv/include/cpd_cb.h    |    4 ++
>   osaf/libs/common/cpsv/include/cpd_imm.h   |    2 +
>   osaf/libs/common/cpsv/include/cpd_proc.h  |    7 +++
>   osaf/libs/common/cpsv/include/cpd_tmr.h   |    3 +-
>   osaf/libs/common/cpsv/include/cpnd_cb.h   |    3 +
>   osaf/libs/common/cpsv/include/cpnd_init.h |    3 +
>   osaf/libs/common/cpsv/include/cpsv_evt.h  |   20 ++++++++++
>   osaf/services/saf/cpsv/cpd/Makefile.am    |    3 +-
>   osaf/services/saf/cpsv/cpd/cpd_evt.c      |  229
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++
>   osaf/services/saf/cpsv/cpd/cpd_imm.c      |  202
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++++++++++++++++++++++++
>   osaf/services/saf/cpsv/cpd/cpd_init.c     |   26 ++++++++++++-
>   osaf/services/saf/cpsv/cpd/cpd_proc.c     |  309
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++
>   osaf/services/saf/cpsv/cpd/cpd_tmr.c      |    7 +++
>   osaf/services/saf/cpsv/cpnd/Makefile.am   |    6 ++-
>   osaf/services/saf/cpsv/cpnd/cpnd_db.c     |   16 ++++++++
>   osaf/services/saf/cpsv/cpnd/cpnd_evt.c    |   24 ++++++++++++
>   osaf/services/saf/cpsv/cpnd/cpnd_init.c   |   34 ++++++++++++++++-
>   osaf/services/saf/cpsv/cpnd/cpnd_mds.c    |   13 ++++++
>   osaf/services/saf/cpsv/cpnd/cpnd_proc.c   |  429
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>   21 files changed, 1423 insertions(+), 12 deletions(-)
>
>
> Testing Commands:
> -----------------
> -
>
> Testing, Expected Results:
> --------------------------
> -
>
>
> Conditions of Submission:
> -------------------------
> -
>
>
> Arch      Built     Started    Linux distro
> -------------------------------------------
> mips        n          n
> mips64      n          n
> x86         n          n
> x86_64      y          y
> powerpc     n          n
> powerpc64   n          n
>
>
> Reviewer Checklist:
> -------------------
> [Submitters: make sure that your review doesn't trigger any
> checkmarks!]
>
>
> Your checkin has not passed review because (see checked entries):
>
> ___ Your RR template is generally incomplete; it has too many blank entries
>      that need proper data filled in.
>
> ___ You have failed to nominate the proper persons for review and push.
>
> ___ Your patches do not have proper short+long header
>
> ___ You have grammar/spelling in your header that is unacceptable.
>
> ___ You have exceeded a sensible line length in your headers/comments/text.
>
> ___ You have failed to put in a proper Trac Ticket # into your commits.
>
> ___ You have incorrectly put/left internal data in your comments/files
>      (i.e. internal bug tracking tool IDs, product names etc)
>
> ___ You have not given any evidence of testing beyond basic build tests.
>      Demonstrate some level of runtime or other sanity testing.
>
> ___ You have ^M present in some of your files. These have to be removed.
>
> ___ You have needlessly changed whitespace or added whitespace crimes
>      like trailing spaces, or spaces before tabs.
>
> ___ You have mixed real technical changes with whitespace and other
>      cosmetic code cleanup changes. These have to be separate commits.
>
> ___ You need to refactor your submission into logical chunks; there is
>      too much content into a single commit.
>
> ___ You have extraneous garbage in your review (merge commits etc)
>
> ___ You have giant attachments which should never have been sent;
>      Instead you should place your content in a public tree to be pulled.
>
> ___ You have too many commits attached to an e-mail; resend as threaded
>      commits, or place in a public tree for a pull.
>
> ___ You have resent this content multiple times without a clear indication
>      of what has changed between each re-send.
>
> ___ You have failed to adequately and individually address all of the
>      comments and change requests that were proposed in the initial review.
>
> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>
> ___ Your computer have a badly configured date and time; confusing the
>      the threaded patch review.
>
> ___ Your changes affect IPC mechanism, and you don't present any results
>      for in-service upgradability test.
>
> ___ Your changes affect user manual and documentation, your patch series
>      do not contain the patch that updates the Doxygen manual.
>
>
> ----------------------------------------------------------------------
> ------
> --
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Opensaf-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>
>



------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support preserving and recovering checkpoint replicas during headless state V3 [#1621]

Reply via email to