Hi Nhat Pham,
ACK form me :
Tested & reviewed following :
-----------------------------------------
1) Application running on payloads and controllers WITHOUT headless
state occurrences
2) Application running on payloads and controllers WITH headless state
occurrences
3) Application running on payloads and controllers CPND restart & WITH
headless state occurrences
3) In service upgrade while application running on payload WITH
headless state occurrences
(Active controller old without #1621 patch , Standby controller
new with #1621 patch and 2 payloads old without #1621 patch)
4) In service upgrade while application running on Standby/Active
new/old Controllers WITH headless state occurrences
(Active controller old without #1621 patch , Standby controller
new with #1621 patch and 2 payloads old without #1621 patch)
No major issue observed except some `saImmOiRtObjectCreate_2 failed with
error = 14`
=================================================================================
Mar 10 15:22:57 SC-1 osafamfd[15219]: NO Node 'PL-3' joined the cluster
Mar 10 15:23:12 SC-1 osafckptd[15382]: ER saImmOiRtObjectCreate_2 failed
with error = 14
Mar 10 15:23:12 SC-1 osafckptd[15382]: ER create runtime ckpt object
failed with error: 14
=================================================================================
Not yet tested :
-----------------------------------------
1) Ckpt Data content restoration after headless state occurrences
Note : Observed some issue with dataSize in the ioVector and this
testing i will contentiou ,
if we find any issue we can fix any issue found in CKPT data restore.
=================================================================================
520|0 304 13304835 1 68| Verifying IO Vector returned in the
CheckpointTrackCallback
520|0 304 13304835 1 69| Size of the data written 13 and the dataSize in
the ioVector 46 doesnot match
=================================================================================
General note :
---------------------
Please add README in CPSV folder .
-AVM
>
>
> On 3/3/2016 3:41 PM, Nhat Pham wrote:
>> Hi Mahesh,
>>
>> Have you reviewed the patch?
>>
>> Best regards,
>> Nhat Pham
>>
>> -----Original Message-----
>> From: A V Mahesh [mailto:[email protected]]
>> Sent: Monday, February 29, 2016 1:15 PM
>> To: Nhat Pham <[email protected]>; [email protected]
>> Cc: [email protected]
>> Subject: Re: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
>> preserving and recovering checkpoint replicas during headless state
>> V3 [#1621]
>>
>> Hi Nhat Pham,
>>
>> I will review V3 patch and do the final functional testing and get
>> back to
>> you soon.
>> ( I may take some time , I also need to work on my published MDS
>> enhancements )
>>
>> -AVM
>>
>>
>> On 2/29/2016 9:39 AM, Nhat Pham wrote:
>>> Hi,
>>>
>>> Following is the summary of updating in V3:
>>>
>>> Comment 1: This functionality should be under checks if Hydra
>>> configuration is enabled in IMM attrName =
>>> const_cast<SaImmAttrNameT>("scAbsenceAllowed").
>>>
>>> Status: Included in V3
>>>
>>> Comment 2: To keep the scope of CPSV service as non-collocated
>>> checkpoint creation NOT_SUPPORTED , if cluster is running with
>>> IMMSV_SC_ABSENCE_ALLOWED ( headless state configuration enabled at the
>>> time of cluster startup currently it is not configurable , so there no
>>> chance of run-time configuration change ).
>>>
>>> Status: No change in code. The CPSV still keep supporting
>>> non-collocated checkpoint even if IMMSV_SC_ABSENCE_ALLOWED is enable.
>>>
>>> Comment 3: This is about case where checkpoint node director (cpnd)
>>> crashes during headless state. In this case the cpnd can't finish
>>> starting because it can't initialize CLM service.
>>> Then after time out, the AMF triggers a restart again. Finally, the
>>> node is rebooted.
>>> It is expected that this problem should not lead to a node reboot.
>>>
>>> Status: Included in V3. CPND reinitializes CLM service if the fault
>>> TRY_AGAIN is returned.
>>>
>>> Comment 4: The Suggestion was to re-create the checkpoint without any
>>> sections in case the all replicas is lost. If the sections were
>>> re-created, the application wouldn't know that data has been lost. I
>>> think the BAD_HANDLE approach is okay since we have used it in other
>>> services, but I see it as kind of a hack solution that is not really
>>> in line
>>> with the specs.
>>> The specs never intended BAD_HANDLE to be something that can happen
>>> spontaneously on a previously valid handle, lest you are suffering
>>> from memory corruption. In the future we could consider the
>>> feasibility of avoiding spontaneous BAD_HANDLE where possible, and in
>>> CKPT I think it may be possible by re-creating the checkpoints.
>>>
>>> Status: NOT included in V3.
>>> This change is quite much and requires a detailed design in different
>>> scenarios. I would suggest to create an enhancement ticket for this.
>>> How would you think?
>>>
>>> Best regards,
>>> Nhat Pham
>>>
>>> -----Original Message-----
>>> From: Nhat Pham [mailto:[email protected]]
>>> Sent: Monday, February 29, 2016 11:06 AM
>>> To: [email protected]; [email protected]
>>> Cc: [email protected]
>>> Subject: [devel] [PATCH 0 of 1] Review Request for cpsv: Support
>>> preserving and recovering checkpoint replicas during headless state V3
>>> [#1621]
>>>
>>> Summary: cpsv: Support preserving and recovering checkpoint replicas
>>> during headless state V3 [#1621] Review request for Trac Ticket(s):
>>> 1621 Peer
>>> Reviewer(s): [email protected]; [email protected] Pull
>>> request to: [email protected] Affected branch(es): default
>>> Development
>>> branch: default
>>>
>>> --------------------------------
>>> Impacted area Impact y/n
>>> --------------------------------
>>> Docs n
>>> Build system n
>>> RPM/packaging n
>>> Configuration files n
>>> Startup scripts n
>>> SAF services y
>>> OpenSAF services n
>>> Core libraries n
>>> Samples n
>>> Tests n
>>> Other n
>>>
>>>
>>> Comments (indicate scope for each "y" above):
>>> ---------------------------------------------
>>>
>>> changeset 8559fe4cea27efc8234f7cf779f3c7413efcd40f
>>> Author: Nhat Pham <[email protected]>
>>> Date: Mon, 29 Feb 2016 11:02:15 +0700
>>>
>>> cpsv: Support preserving and recovering checkpoint replicas during
>>> headless state V3 [#1621]
>>>
>>> Background:
>>> ----------
>>> This enhancement supports to preserve checkpoint replicas in case
>>> both SCs down (headless state) and recover replicas in case one of
>>> SCs up
>>> again. If both SCs goes down, checkpoint replicas on surviving
>>> nodes
>>> still
>>> remain. When a SC is available again, surviving replicas are
>>> automatically
>>> registered to the SC checkpoint database. Content in surviving
>>> replicas are
>>> intacted and synchronized to new replicas.
>>>
>>> When no SC is available, client API calls changing checkpoint
>>> configuration
>>> which requires SC communication, are rejected. Client API calls
>>> reading and
>>> writing existing checkpoint replicas still work.
>>>
>>> Limitation: The CKPT service does not support recovering
>>> checkpoints
>>> in
>>> following cases:
>>> - The checkpoint which is unlinked before headless.
>>> - The non-collocated checkpoint has active replica locating on SC.
>>> - The non-collocated checkpoint has active replica locating on
>>> a PL
>>> and this
>>> PL restarts during headless state. In this cases, the checkpoint
>>> replica is
>>> destroyed. The fault code SA_AIS_ERR_BAD_HANDLE is returned when
>>> the
>>> client
>>> accesses the checkpoint in these cases. The client must re-open the
>>> checkpoint.
>>>
>>> While in headless state, accessing checkpoint replicas does not
>>> work
>>> if the
>>> node which hosts the active replica goes down. It will back working
>>> when a
>>> SC available again.
>>>
>>> Solution:
>>> ---------
>>> The solution for this enhancement includes 2 parts:
>>>
>>> 1. To destroy un-recoverable checkpoint described above when
>>> both SCs
>>> are
>>> down: When both SCs are down, the CPND deletes un-recoverable
>>> checkpoint
>>> nodes and replicas on PLs. Then it requests CPA to destroy
>>> corresponding
>>> checkpoint node by using new message CPA_EVT_ND2A_CKPT_DESTROY
>>>
>>> 2. To update CPD with checkpoint information When an active SC
>>> is up
>>> after
>>> headless, CPND will update CPD with checkpoint information by using
>>> new
>>> message CPD_EVT_ND2D_CKPT_INFO_UPDATE instead of using
>>> CPD_EVT_ND2D_CKPT_CREATE. This is because the CPND will create new
>>> ckpt_id
>>> for the checkpoint which might be different with the current
>>> ckpt id
>>> if the
>>> CPD_EVT_ND2D_CKPT_CREATE is used. The CPD collects checkpoint
>>> information
>>> within 6s. During this updating time, following requests is
>>> rejected
>>> with
>>> fault code SA_AIS_ERR_TRY_AGAIN:
>>> - CPD_EVT_ND2D_CKPT_CREATE
>>> - CPD_EVT_ND2D_CKPT_UNLINK
>>> - CPD_EVT_ND2D_ACTIVE_SET
>>> - CPD_EVT_ND2D_CKPT_RDSET
>>>
>>>
>>> Complete diffstat:
>>> ------------------
>>> osaf/libs/agents/saf/cpa/cpa_proc.c | 52
>>> ++++++++++++++++++++++++++
>>> osaf/libs/common/cpsv/cpsv_edu.c | 43
>>> +++++++++++++++++++++
>>> osaf/libs/common/cpsv/include/cpd_cb.h | 4 ++
>>> osaf/libs/common/cpsv/include/cpd_imm.h | 2 +
>>> osaf/libs/common/cpsv/include/cpd_proc.h | 7 +++
>>> osaf/libs/common/cpsv/include/cpd_tmr.h | 3 +-
>>> osaf/libs/common/cpsv/include/cpnd_cb.h | 3 +
>>> osaf/libs/common/cpsv/include/cpnd_init.h | 3 +
>>> osaf/libs/common/cpsv/include/cpsv_evt.h | 20 ++++++++++
>>> osaf/services/saf/cpsv/cpd/Makefile.am | 3 +-
>>> osaf/services/saf/cpsv/cpd/cpd_evt.c | 229
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++
>>> osaf/services/saf/cpsv/cpd/cpd_imm.c | 202
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>> +++++++++++++++++++++++++
>>> osaf/services/saf/cpsv/cpd/cpd_init.c | 26 ++++++++++++-
>>> osaf/services/saf/cpsv/cpd/cpd_proc.c | 309
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>> ++
>>> osaf/services/saf/cpsv/cpd/cpd_tmr.c | 7 +++
>>> osaf/services/saf/cpsv/cpnd/Makefile.am | 6 ++-
>>> osaf/services/saf/cpsv/cpnd/cpnd_db.c | 16 ++++++++
>>> osaf/services/saf/cpsv/cpnd/cpnd_evt.c | 24 ++++++++++++
>>> osaf/services/saf/cpsv/cpnd/cpnd_init.c | 34 ++++++++++++++++-
>>> osaf/services/saf/cpsv/cpnd/cpnd_mds.c | 13 ++++++
>>> osaf/services/saf/cpsv/cpnd/cpnd_proc.c | 429
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>> 21 files changed, 1423 insertions(+), 12 deletions(-)
>>>
>>>
>>> Testing Commands:
>>> -----------------
>>> -
>>>
>>> Testing, Expected Results:
>>> --------------------------
>>> -
>>>
>>>
>>> Conditions of Submission:
>>> -------------------------
>>> -
>>>
>>>
>>> Arch Built Started Linux distro
>>> -------------------------------------------
>>> mips n n
>>> mips64 n n
>>> x86 n n
>>> x86_64 y y
>>> powerpc n n
>>> powerpc64 n n
>>>
>>>
>>> Reviewer Checklist:
>>> -------------------
>>> [Submitters: make sure that your review doesn't trigger any
>>> checkmarks!]
>>>
>>>
>>> Your checkin has not passed review because (see checked entries):
>>>
>>> ___ Your RR template is generally incomplete; it has too many blank
>>> entries
>>> that need proper data filled in.
>>>
>>> ___ You have failed to nominate the proper persons for review and push.
>>>
>>> ___ Your patches do not have proper short+long header
>>>
>>> ___ You have grammar/spelling in your header that is unacceptable.
>>>
>>> ___ You have exceeded a sensible line length in your
>>> headers/comments/text.
>>>
>>> ___ You have failed to put in a proper Trac Ticket # into your commits.
>>>
>>> ___ You have incorrectly put/left internal data in your comments/files
>>> (i.e. internal bug tracking tool IDs, product names etc)
>>>
>>> ___ You have not given any evidence of testing beyond basic build
>>> tests.
>>> Demonstrate some level of runtime or other sanity testing.
>>>
>>> ___ You have ^M present in some of your files. These have to be
>>> removed.
>>>
>>> ___ You have needlessly changed whitespace or added whitespace crimes
>>> like trailing spaces, or spaces before tabs.
>>>
>>> ___ You have mixed real technical changes with whitespace and other
>>> cosmetic code cleanup changes. These have to be separate commits.
>>>
>>> ___ You need to refactor your submission into logical chunks; there is
>>> too much content into a single commit.
>>>
>>> ___ You have extraneous garbage in your review (merge commits etc)
>>>
>>> ___ You have giant attachments which should never have been sent;
>>> Instead you should place your content in a public tree to be
>>> pulled.
>>>
>>> ___ You have too many commits attached to an e-mail; resend as threaded
>>> commits, or place in a public tree for a pull.
>>>
>>> ___ You have resent this content multiple times without a clear
>>> indication
>>> of what has changed between each re-send.
>>>
>>> ___ You have failed to adequately and individually address all of the
>>> comments and change requests that were proposed in the initial
>>> review.
>>>
>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>>>
>>> ___ Your computer have a badly configured date and time; confusing the
>>> the threaded patch review.
>>>
>>> ___ Your changes affect IPC mechanism, and you don't present any
>>> results
>>> for in-service upgradability test.
>>>
>>> ___ Your changes affect user manual and documentation, your patch
>>> series
>>> do not contain the patch that updates the Doxygen manual.
>>>
>>>
>>> ----------------------------------------------------------------------
>>> ------
>>> --
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
>>> _______________________________________________
>>> Opensaf-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>
>>>
>>
>
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel