Hi Hans,

     Changing rmem_default and rmem_max has no effect on the problem.  I 
even tried up to 2M to no avail.

     However, after looking at the cpnd_transfer_replica function in 
cpnd_evt.c, I found the following in cpsv_evt.h which controls how large 
the packets are which are sent through MDS:

#define MAX_SYNC_TRANSFER_SIZE           (30 * 1024 * 1024)

     30M?  What is the rationale for this number?  This seems way too 
high.  When I change it to (4*1024*1024) (4M) it solves my problem, and 
doesn't appear to affect performance.

Alex

On 01/08/2014 08:30 AM, Hans Feldt wrote:
> sysctl -a | grep rmem
>
> set rmem_default to 256K or so
>
> /Hans
>
>> -----Original Message-----
>> From: Hans Feldt [mailto:hans.fe...@ericsson.com]
>> Sent: den 8 januari 2014 14:01
>> To: A V Mahesh; Alex Jones
>> Cc: opensaf-devel@lists.sourceforge.net
>> Subject: Re: [devel] checkpoint problems
>>
>> The socket receive buffer size used is the system default. It can be too 
>> small, pump it up.
>> I plan todo some change in MDS for this (and other stuff).
>> /Hans
>>
>>> -----Original Message-----
>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>>> Sent: den 8 januari 2014 11:29
>>> To: Alex Jones
>>> Cc: opensaf-devel@lists.sourceforge.net
>>> Subject: Re: [devel] checkpoint problems
>>>
>>> Hi Alex,
>>>
>>> I suggest you increase and try the following TIPC values ( tipc code )
>>> and rebuild `tipc.ko`:
>>>
>>> net/tipc/tipc_socket.c:#define OVERLOAD_LIMIT_BASE      5000
>>>
>>> You can increase it to 50000 and try again.
>>>
>>> - AVM.
>>>
>>> On 1/8/2014 4:16 AM, Alex Jones wrote:
>>>> After doing some deep debugging I am seeing the following in the MDS
>>>> log on node B.  This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is
>>>> sent from the active replica on node A to the replica on node B.  The
>>>> sync message never gets up to the CPND layer on node B because it is
>>>> dropped.
>>>>
>>>> This is with 10k sections, each section 1k.
>>>>
>>>> Jan  7 21:32:32.772347 <1789648919> ERR    |MDTM: Frag recd is not
>>>> next frag so dropping adest=<0x010010023922604c>
>>>> Jan  7 21:32:32.772399 <1789648919> ERR    |MDTM: Message is dropped
>>>> as msg is out of seq TRANSPOR-ID=<0x010010023922604c>
>>>>
>>>> I've turned on MDS debug on node B, and the packet being sent over is
>>>> gigantic.  It starts failing at fragment number 2703.  The next
>>>> fragment that comes in is 2707, then 2722.  The last fragment that
>>>> comes in is 7444.
>>>>
>>>> I've done a cursory look at the hardware stats, and nothing is being
>>>> rate-limited or dropped.
>>>>
>>>> I'm going to take a deeper look at this, but I'm mentioning it in case
>>>> it rings any bells.  I am using TIPC as the transport.
>>>>
>>>> Alex
>>>>
>>>> On 01/07/2014 07:24 AM, Alex Jones wrote:
>>>>> AVM,
>>>>>
>>>>>      I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the
>>>>> timeout value.  Is this not a bug?  the synchronous CheckpointOpen
>>>>> call doesn't work at all in this scenario.  It never succeeds.
>>>>>
>>>>>      I can reproduce the problem with
>>>>> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY.
>>>>>
>>>>>      You should be able to reproduce the problem with the code I sent
>>>>> in the last e-mail.
>>>>>
>>>>> Alex
>>>>>
>>>>> On 01/06/2014 10:31 PM, A V Mahesh wrote:
>>>>>> Hi Alex,
>>>>>>
>>>>>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT   NOT a bug , it
>>>>>> is expected if you pass  less time out value `timeout = 1000000000`
>>>>>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very
>>>>>> large data/section. just increasing timeout will avoids the
>>>>>> SA_AIS_ERR_TIMEOUT.
>>>>>>
>>>>>> Let us focus on your original issue/scenario, are you able to
>>>>>> reproduce the  problem with sectionCreationAttributes.expirationTime
>>>>>> with SA_TIME_ONE_DAY ?
>>>>>>
>>>>>> -AVM
>>>>>>
>>>>>> On 1/7/2014 1:17 AM, Alex Jones wrote:
>>>>>>> AVM,
>>>>>>>
>>>>>>>      I've been playing around with your test program, and have
>>>>>>> gotten it to fail.
>>>>>>>
>>>>>>>      I made the following changes:
>>>>>>>
>>>>>>>   1. Change init_dataX to be 1024k bytes, so that you are
>>>>>>>      initializing the section to be 1024k.
>>>>>>>   2. Also, don't start the program on node B until A has finished
>>>>>>>      writing/creating all the sections.
>>>>>>>   3. Before hitting the enter key on node B, wait for the OpenAsync
>>>>>>>      call to finish.
>>>>>>>
>>>>>>>      You might notice the CheckpointOpen call failing now with
>>>>>>> SA_AIS_ERR_TIMEOUT.  I had to turn this into OpenAsync, and add a
>>>>>>> thread to process CkptDispatch messages.  This uncovers another bug
>>>>>>> in OpenAsync.  I've attached the mods to your program here.
>>>>>>>
>>>>>>>     The OpenAsync callback will be called twice, both times with
>>>>>>> error == SA_AIS_ERR_TIMEOUT.  If I call OpenAsync again when I get
>>>>>>> this error, the next callback returns success, but the callback
>>>>>>> gets called twice with success and with two different checkpoint
>>>>>>> handles!
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>> On 01/06/2014 06:18 AM, A V Mahesh wrote:
>>>>>>>> Hi Alex,
>>>>>>>>
>>>>>>>> I have  created 10K sections  ( please find the attached test
>>>>>>>> application  `Alex_test_node_A_app.c`  & `Alex_test_node_B_app.c ` )
>>>>>>>> with your specified scenario & configuration and I haven't observed any
>>>>>>>> issue with  sections  on another node.
>>>>>>>>
>>>>>>>> Try to reproduce the problem on your setup & let me know the result .
>>>>>>>>
>>>>>>>> One more importent point how much did you configured
>>>>>>>> `sectionCreationAttributes.expirationTime `  ?
>>>>>>>> I configured  SA_TIME_ONE_DAY.
>>>>>>>>
>>>>>>>> Steps to rung the application :
>>>>>>>>
>>>>>>>>
>> ======================================================================================================
>>> =============
>>>>>>>> Compile :
>>>>>>>>
>>>>>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt
>>>>>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt
>>>>>>>>
>>>>>>>>
>>>>>>>> Run :
>>>>>>>>
>>>>>>>> 1) saCkptCheckpointOpen On node A
>>>>>>>>
>>>>>>>> NODE-A# ./checkpoint_A
>>>>>>>>
>>>>>>>> CPSV:CPA:ONsaCkptSectionCreate  Waiting to Create Sections
>>>>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>>>>>> saCkptSectionCreate Press <Enter> key to continue...
>>>>>>>>
>>>>>>>> .
>>>>>>>> 2) saCkptCheckpointOpen() same ckpt On node B
>>>>>>>>
>>>>>>>> NODE-B# ./checkpoint_B
>>>>>>>>
>>>>>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read Sections
>>>>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>>>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press <Enter>
>>>>>>>> key to continue...
>>>>>>>>
>>>>>>>>
>>>>>>>> 3) saCkptSectionCreate() On node A  and read 
>>>>>>>> saCkptCheckpointStatusGet()
>>>>>>>>
>>>>>>>> NODE-A#
>>>>>>>>     checkpointStatus.numberOfSections : 10000
>>>>>>>>     checkpointStatus.memoryUsed :756000
>>>>>>>>      checkpointCreationAttributes.creationFlags;10
>>>>>>>>     checkpointCreationAttributes.checkpointSize;10240000
>>>>>>>>     checkpointCreationAttributes.retentionDuration;60000000000
>>>>>>>>     checkpointCreationAttributes.maxSections;10000
>>>>>>>>     checkpointCreationAttributes.maxSectionSize;1024
>>>>>>>>     checkpointCreationAttributes.maxSectionIdSize;64
>>>>>>>>     ================================
>>>>>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press
>>>>>>>> <Enter> key to continue...
>>>>>>>> saCkptCheckpoint Press <Enter> key to continue...
>>>>>>>>
>>>>>>>>
>>>>>>>> 4) saCkptActiveReplicaSet() & On node B  and 
>>>>>>>> saCkptCheckpointStatusGet()
>>>>>>>>
>>>>>>>> NODE-B#
>>>>>>>>     checkpointStatus.numberOfSections : 10000
>>>>>>>>     checkpointStatus.memoryUsed :756000
>>>>>>>>      checkpointCreationAttributes.creationFlags;10
>>>>>>>>     checkpointCreationAttributes.checkpointSize;10240000
>>>>>>>>     checkpointCreationAttributes.retentionDuration;60000000000
>>>>>>>>     checkpointCreationAttributes.maxSections;10000
>>>>>>>>     checkpointCreationAttributes.maxSectionSize;1024
>>>>>>>>     checkpointCreationAttributes.maxSectionIdSize;64
>>>>>>>>
>>>>>>>>     saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize 
>>>>>>>> Press
>>>>>>>> <Enter> key to continue...
>>>>>>>>     saCkptCheckpoint Press <Enter> key to continue..
>>>>>>>>
>>>>>>>>
>> ======================================================================================================
>>> ==========================
>>>>>>>> -AVM
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote:
>>>>>>>>> Hi Alex,
>>>>>>>>>
>>>>>>>>> We never tested the  7500 sections , will test & and let you know ,
>>>>>>>>> can you please share your test application ,
>>>>>>>>>    that allow us to respond quick.
>>>>>>>>>
>>>>>>>>> -AVM
>>>>>>>>>
>>>>>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote:
>>>>>>>>>> Hello All,
>>>>>>>>>>
>>>>>>>>>>        I'm experimenting with the checkpoint service, and some things
>>>>>>>>>> don't appear to work.
>>>>>>>>>>
>>>>>>>>>>        The saCkptActiveReplicaSet and
>>>>>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work when the
>>>>>>>>>> checkpoint has section numbers greater than around 5500.
>>>>>>>>>>
>>>>>>>>>>        I've created a checkpoint with 7500 sections, each section 
>>>>>>>>>> being
>>>>>>>>>> 1024 bytes.  The checkpoint is co-located and the "active replica"
>>>>>>>>>> bit is set.
>>>>>>>>>>
>>>>>>>>>>        I can create and write all the sections.  And from another 
>>>>>>>>>> node
>>>>>>>>>> I run saCkptCheckpointStatusGet, and the information all looks good.
>>>>>>>>>> Everything is there.  I see no errors from any CKPT API calls.
>>>>>>>>>>
>>>>>>>>>>        The problem comes when I call saCkptActiveReplicaSet from this
>>>>>>>>>> other node.  After I do this, saCkptCheckpointStatusGet now returns
>>>>>>>>>> all the same information except the number of sections is no longer
>>>>>>>>>> 7500 but 0.  If I do this test with 50,000 sections only about 3,000
>>>>>>>>>> entries get synced.  And iterating through the sections shows that
>>>>>>>>>> there are only 3,000 sections.
>>>>>>>>>>
>>>>>>>>>>        Calling saCkptCheckpointSynchronize[Async] in this situation 
>>>>>>>>>> has
>>>>>>>>>> no effect, either.
>>>>>>>>>>
>>>>>>>>>>        After looking through the code I see a comment in
>>>>>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is missing
>>>>>>>>>> with old active if now this fellow is becoming active. */"  So, it
>>>>>>>>>> doesn't appear that syncing is being done in the
>>>>>>>>>> saCkptActiveReplicaSet, which it should be.
>>>>>>>>>>
>>>>>>>>>>        Can someone comment?
>>>>>>>>>>
>>>>>>>>>>        I'm going to fix this and post a patch unless someone else is
>>>>>>>>>> already working on it, but I didn't see a bug for it.
>>>>>>>>>>
>>>>>>>>>> Alex
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> Rapidly troubleshoot problems before they affect your business. Most 
>>>>>>>>>> IT
>>>>>>>>>> organizations don't have a clear picture of how application 
>>>>>>>>>> performance
>>>>>>>>>> affects their revenue. With AppDynamics, you get 100% visibility into
>>>>>>>>>> your
>>>>>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>>>>>>>>> AppDynamics Pro!
>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Opensaf-devel mailing list
>>>>>>>>>> Opensaf-devel@lists.sourceforge.net
>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>> ------------------------------------------------------------------------------
>>> Rapidly troubleshoot problems before they affect your business. Most IT
>>> organizations don't have a clear picture of how application performance
>>> affects their revenue. With AppDynamics, you get 100% visibility into your
>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics 
>>> Pro!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Opensaf-devel mailing list
>>> Opensaf-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics 
>> Pro!
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Opensaf-devel mailing list
>> Opensaf-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel



------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to