Hi Alex,

On 10/5/2015 9:40 PM, Alex Jones wrote:
> I do want to look into why the sync is timing out.

As I understood When a collocated checkpoint replica is opened, and the 
active
replica has large numbers of sections (~200k) and each sections
size is approximately 2k , you are seeing the issue .


So for debugging/assing the issue  just tune these below cpasv sync 
timeout variables fist ,
f increasing these values resolves the issue, then we can think of 
alternates  solution ,
some thing  like dynamically calculating the sync time out time value 
for the NON active collocated checkpoint replica opened

=================================================================================================================
osaf/libs/common/cpsv/include/cpa_def.h:#define CPSV_WAIT_TIME 1400    
/* MDS wait time in case of syncronous call */
osaf/libs/common/cpsv/include/cpnd_cb.h:#define CPSV_WAIT_TIME  1000
osaf/libs/common/cpsv/include/cpd_cb.h:#define CPSV_WAIT_TIME  1000
=================================================================================================================

-AVM

On 10/5/2015 9:40 PM, Alex Jones wrote:
> Hi AVM,
>
>    The size of the sections is approximately 2k.
>
>    I don't have a test program that exhibits this.  This was captured in our 
> product.
>
>    I do want to look into why the sync is timing out.  I may create another 
> bug for that depending on what I find.
>
> Alex
>
> ________________________________________
> From: A V Mahesh [[email protected]]
> Sent: Monday, October 05, 2015 4:35 AM
> To: Alex Jones
> Cc: [email protected]
> Subject: Re: [PATCH 0 of 1] Review Request for CKPT: fix crash in cpnd when 
> checkpoint open sync to active times out [#1510]
>
> Hi Alex,
>
> On 10/1/2015 10:34 PM, Alex Jones wrote:
>> When a collocated checkpoint replica is opened, and the active replica
>> has
>>      large numbers of sections (~200k),
> can you please share  the size of each sections .
>
> -AVM
>
>
> On 10/5/2015 9:31 AM, A V Mahesh wrote:
>> Hi Alex,
>>
>> If you have ready to use test application can you please attach.
>>
>> -AVM
>>
>> On 10/1/2015 10:34 PM, Alex Jones wrote:
>>> Summary: CKPT: fix crash in cpnd when opening replica times out [#1510]
>>> Review request for Trac Ticket(s): 1510
>>> Peer Reviewer(s): AVM
>>> Pull request to: AVM
>>> Affected branch(es): default, 4.7, 4.6, 4.5
>>> Development branch: <<IF ANY GIVE THE REPO URL>>
>>>
>>> --------------------------------
>>> Impacted area       Impact y/n
>>> --------------------------------
>>>    Docs                    n
>>>    Build system            n
>>>    RPM/packaging           n
>>>    Configuration files     n
>>>    Startup scripts         n
>>>    SAF services            y
>>>    OpenSAF services        n
>>>    Core libraries          n
>>>    Samples                 n
>>>    Tests                   n
>>>    Other                   n
>>>
>>>
>>> Comments (indicate scope for each "y" above):
>>> ---------------------------------------------
>>>    <<EXPLAIN/COMMENT THE PATCH SERIES HERE>>
>>>
>>> changeset 923566e6c96312c15330b4e8ed0c81a80a2701f0
>>> Author:    Alex Jones <[email protected]>
>>> Date:    Thu, 01 Oct 2015 12:56:53 -0400
>>>
>>>      ckptnd: fix crash when checkpoint open sync to active times out
>>> [#1510]
>>>
>>>      ckptnd core dumps with many different stack traces
>>>
>>>      When a collocated checkpoint replica is opened, and the active
>>> replica has
>>>      large numbers of sections (~200k), the sync from the active to
>>> the replica
>>>      can timeout. If the MDS sync succeeds, but the error code in the
>>> out_evt is
>>>      not SA_AIS_OK, the current code jumps to the
>>> ckpt_shm_node_free_error label.
>>>      The code under this label assumes that the node was not
>>> successfully created
>>>      in the database, so doesn't remove it. But in this case it was
>>> created. The
>>>      node memory is freed, but the node is not removed from the
>>> database. The
>>>      next time this checkpoint is accessed, cpnd will access freed
>>> memory and
>>>      crash.
>>>
>>>      Set a flag after the node has been added to the database. And in the
>>>      ckpt_node_free_error label, remove the node from the database if
>>> it was
>>>      added.
>>>
>>>
>>> Complete diffstat:
>>> ------------------
>>>    osaf/services/saf/cpsv/cpnd/cpnd_evt.c |  10 ++++++++++
>>>    1 files changed, 10 insertions(+), 0 deletions(-)
>>>
>>>
>>> Testing Commands:
>>> -----------------
>>> 1) create a collocated checkpoint with 200k sections, and continue
>>> updating the
>>>      sections
>>> 2) open the same checkpoint on another node (this creates a replica)
>>>
>>>
>>> Testing, Expected Results:
>>> --------------------------
>>> 1) cpnd on the replica node should not crash, and sync should succeed
>>>
>>>
>>> Conditions of Submission:
>>> -------------------------
>>>    <<HOW MANY DAYS BEFORE PUSHING, CONSENSUS ETC>>
>>>
>>>
>>> Arch      Built     Started    Linux distro
>>> -------------------------------------------
>>> mips        n          n
>>> mips64      n          n
>>> x86         n          n
>>> x86_64      y          y
>>> powerpc     n          n
>>> powerpc64   n          n
>>>
>>>
>>> Reviewer Checklist:
>>> -------------------
>>> [Submitters: make sure that your review doesn't trigger any checkmarks!]
>>>
>>>
>>> Your checkin has not passed review because (see checked entries):
>>>
>>> ___ Your RR template is generally incomplete; it has too many blank
>>> entries
>>>       that need proper data filled in.
>>>
>>> ___ You have failed to nominate the proper persons for review and push.
>>>
>>> ___ Your patches do not have proper short+long header
>>>
>>> ___ You have grammar/spelling in your header that is unacceptable.
>>>
>>> ___ You have exceeded a sensible line length in your
>>> headers/comments/text.
>>>
>>> ___ You have failed to put in a proper Trac Ticket # into your commits.
>>>
>>> ___ You have incorrectly put/left internal data in your comments/files
>>>       (i.e. internal bug tracking tool IDs, product names etc)
>>>
>>> ___ You have not given any evidence of testing beyond basic build tests.
>>>       Demonstrate some level of runtime or other sanity testing.
>>>
>>> ___ You have ^M present in some of your files. These have to be removed.
>>>
>>> ___ You have needlessly changed whitespace or added whitespace crimes
>>>       like trailing spaces, or spaces before tabs.
>>>
>>> ___ You have mixed real technical changes with whitespace and other
>>>       cosmetic code cleanup changes. These have to be separate commits.
>>>
>>> ___ You need to refactor your submission into logical chunks; there is
>>>       too much content into a single commit.
>>>
>>> ___ You have extraneous garbage in your review (merge commits etc)
>>>
>>> ___ You have giant attachments which should never have been sent;
>>>       Instead you should place your content in a public tree to be
>>> pulled.
>>>
>>> ___ You have too many commits attached to an e-mail; resend as threaded
>>>       commits, or place in a public tree for a pull.
>>>
>>> ___ You have resent this content multiple times without a clear
>>> indication
>>>       of what has changed between each re-send.
>>>
>>> ___ You have failed to adequately and individually address all of the
>>>       comments and change requests that were proposed in the initial
>>> review.
>>>
>>> ___ You have a misconfigured ~/.hgrc file (i.e. username, email etc)
>>>
>>> ___ Your computer have a badly configured date and time; confusing the
>>>       the threaded patch review.
>>>
>>> ___ Your changes affect IPC mechanism, and you don't present any results
>>>       for in-service upgradability test.
>>>
>>> ___ Your changes affect user manual and documentation, your patch series
>>>       do not contain the patch that updates the Doxygen manual.
>>>
>
>


------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to