Hi Alex, On 10/5/2015 9:40 PM, Alex Jones wrote: > I do want to look into why the sync is timing out.
As I understood When a collocated checkpoint replica is opened, and the active replica has large numbers of sections (~200k) and each sections size is approximately 2k , you are seeing the issue . So for debugging/assing the issue just tune these below cpasv sync timeout variables fist , f increasing these values resolves the issue, then we can think of alternates solution , some thing like dynamically calculating the sync time out time value for the NON active collocated checkpoint replica opened ================================================================================================================= osaf/libs/common/cpsv/include/cpa_def.h:#define CPSV_WAIT_TIME 1400 /* MDS wait time in case of syncronous call */ osaf/libs/common/cpsv/include/cpnd_cb.h:#define CPSV_WAIT_TIME 1000 osaf/libs/common/cpsv/include/cpd_cb.h:#define CPSV_WAIT_TIME 1000 ================================================================================================================= -AVM On 10/5/2015 9:40 PM, Alex Jones wrote: > Hi AVM, > > The size of the sections is approximately 2k. > > I don't have a test program that exhibits this. This was captured in our > product. > > I do want to look into why the sync is timing out. I may create another > bug for that depending on what I find. > > Alex > > ________________________________________ > From: A V Mahesh [[email protected]] > Sent: Monday, October 05, 2015 4:35 AM > To: Alex Jones > Cc: [email protected] > Subject: Re: [PATCH 0 of 1] Review Request for CKPT: fix crash in cpnd when > checkpoint open sync to active times out [#1510] > > Hi Alex, > > On 10/1/2015 10:34 PM, Alex Jones wrote: >> When a collocated checkpoint replica is opened, and the active replica >> has >> large numbers of sections (~200k), > can you please share the size of each sections . > > -AVM > > > On 10/5/2015 9:31 AM, A V Mahesh wrote: >> Hi Alex, >> >> If you have ready to use test application can you please attach. >> >> -AVM >> >> On 10/1/2015 10:34 PM, Alex Jones wrote: >>> Summary: CKPT: fix crash in cpnd when opening replica times out [#1510] >>> Review request for Trac Ticket(s): 1510 >>> Peer Reviewer(s): AVM >>> Pull request to: AVM >>> Affected branch(es): default, 4.7, 4.6, 4.5 >>> Development branch: <<IF ANY GIVE THE REPO URL>> >>> >>> -------------------------------- >>> Impacted area Impact y/n >>> -------------------------------- >>> Docs n >>> Build system n >>> RPM/packaging n >>> Configuration files n >>> Startup scripts n >>> SAF services y >>> OpenSAF services n >>> Core libraries n >>> Samples n >>> Tests n >>> Other n >>> >>> >>> Comments (indicate scope for each "y" above): >>> --------------------------------------------- >>> <<EXPLAIN/COMMENT THE PATCH SERIES HERE>> >>> >>> changeset 923566e6c96312c15330b4e8ed0c81a80a2701f0 >>> Author: Alex Jones <[email protected]> >>> Date: Thu, 01 Oct 2015 12:56:53 -0400 >>> >>> ckptnd: fix crash when checkpoint open sync to active times out >>> [#1510] >>> >>> ckptnd core dumps with many different stack traces >>> >>> When a collocated checkpoint replica is opened, and the active >>> replica has >>> large numbers of sections (~200k), the sync from the active to >>> the replica >>> can timeout. If the MDS sync succeeds, but the error code in the >>> out_evt is >>> not SA_AIS_OK, the current code jumps to the >>> ckpt_shm_node_free_error label. >>> The code under this label assumes that the node was not >>> successfully created >>> in the database, so doesn't remove it. But in this case it was >>> created. The >>> node memory is freed, but the node is not removed from the >>> database. The >>> next time this checkpoint is accessed, cpnd will access freed >>> memory and >>> crash. >>> >>> Set a flag after the node has been added to the database. And in the >>> ckpt_node_free_error label, remove the node from the database if >>> it was >>> added. >>> >>> >>> Complete diffstat: >>> ------------------ >>> osaf/services/saf/cpsv/cpnd/cpnd_evt.c | 10 ++++++++++ >>> 1 files changed, 10 insertions(+), 0 deletions(-) >>> >>> >>> Testing Commands: >>> ----------------- >>> 1) create a collocated checkpoint with 200k sections, and continue >>> updating the >>> sections >>> 2) open the same checkpoint on another node (this creates a replica) >>> >>> >>> Testing, Expected Results: >>> -------------------------- >>> 1) cpnd on the replica node should not crash, and sync should succeed >>> >>> >>> Conditions of Submission: >>> ------------------------- >>> <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>> >>> >>> >>> Arch Built Started Linux distro >>> ------------------------------------------- >>> mips n n >>> mips64 n n >>> x86 n n >>> x86_64 y y >>> powerpc n n >>> powerpc64 n n >>> >>> >>> Reviewer Checklist: >>> ------------------- >>> [Submitters: make sure that your review doesn't trigger any checkmarks!] >>> >>> >>> Your checkin has not passed review because (see checked entries): >>> >>> ___ Your RR template is generally incomplete; it has too many blank >>> entries >>> that need proper data filled in. >>> >>> ___ You have failed to nominate the proper persons for review and push. >>> >>> ___ Your patches do not have proper short+long header >>> >>> ___ You have grammar/spelling in your header that is unacceptable. >>> >>> ___ You have exceeded a sensible line length in your >>> headers/comments/text. >>> >>> ___ You have failed to put in a proper Trac Ticket # into your commits. >>> >>> ___ You have incorrectly put/left internal data in your comments/files >>> (i.e. internal bug tracking tool IDs, product names etc) >>> >>> ___ You have not given any evidence of testing beyond basic build tests. >>> Demonstrate some level of runtime or other sanity testing. >>> >>> ___ You have ^M present in some of your files. These have to be removed. >>> >>> ___ You have needlessly changed whitespace or added whitespace crimes >>> like trailing spaces, or spaces before tabs. >>> >>> ___ You have mixed real technical changes with whitespace and other >>> cosmetic code cleanup changes. These have to be separate commits. >>> >>> ___ You need to refactor your submission into logical chunks; there is >>> too much content into a single commit. >>> >>> ___ You have extraneous garbage in your review (merge commits etc) >>> >>> ___ You have giant attachments which should never have been sent; >>> Instead you should place your content in a public tree to be >>> pulled. >>> >>> ___ You have too many commits attached to an e-mail; resend as threaded >>> commits, or place in a public tree for a pull. >>> >>> ___ You have resent this content multiple times without a clear >>> indication >>> of what has changed between each re-send. >>> >>> ___ You have failed to adequately and individually address all of the >>> comments and change requests that were proposed in the initial >>> review. >>> >>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc) >>> >>> ___ Your computer have a badly configured date and time; confusing the >>> the threaded patch review. >>> >>> ___ Your changes affect IPC mechanism, and you don't present any results >>> for in-service upgradability test. >>> >>> ___ Your changes affect user manual and documentation, your patch series >>> do not contain the patch that updates the Doxygen manual. >>> > > ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
