Hi Hans, Changing rmem_default and rmem_max has no effect on the problem. I even tried up to 2M to no avail.
However, after looking at the cpnd_transfer_replica function in cpnd_evt.c, I found the following in cpsv_evt.h which controls how large the packets are which are sent through MDS: #define MAX_SYNC_TRANSFER_SIZE (30 * 1024 * 1024) 30M? What is the rationale for this number? This seems way too high. When I change it to (4*1024*1024) (4M) it solves my problem, and doesn't appear to affect performance. Alex On 01/08/2014 08:30 AM, Hans Feldt wrote: > sysctl -a | grep rmem > > set rmem_default to 256K or so > > /Hans > >> -----Original Message----- >> From: Hans Feldt [mailto:hans.fe...@ericsson.com] >> Sent: den 8 januari 2014 14:01 >> To: A V Mahesh; Alex Jones >> Cc: opensaf-devel@lists.sourceforge.net >> Subject: Re: [devel] checkpoint problems >> >> The socket receive buffer size used is the system default. It can be too >> small, pump it up. >> I plan todo some change in MDS for this (and other stuff). >> /Hans >> >>> -----Original Message----- >>> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >>> Sent: den 8 januari 2014 11:29 >>> To: Alex Jones >>> Cc: opensaf-devel@lists.sourceforge.net >>> Subject: Re: [devel] checkpoint problems >>> >>> Hi Alex, >>> >>> I suggest you increase and try the following TIPC values ( tipc code ) >>> and rebuild `tipc.ko`: >>> >>> net/tipc/tipc_socket.c:#define OVERLOAD_LIMIT_BASE 5000 >>> >>> You can increase it to 50000 and try again. >>> >>> - AVM. >>> >>> On 1/8/2014 4:16 AM, Alex Jones wrote: >>>> After doing some deep debugging I am seeing the following in the MDS >>>> log on node B. This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is >>>> sent from the active replica on node A to the replica on node B. The >>>> sync message never gets up to the CPND layer on node B because it is >>>> dropped. >>>> >>>> This is with 10k sections, each section 1k. >>>> >>>> Jan 7 21:32:32.772347 <1789648919> ERR |MDTM: Frag recd is not >>>> next frag so dropping adest=<0x010010023922604c> >>>> Jan 7 21:32:32.772399 <1789648919> ERR |MDTM: Message is dropped >>>> as msg is out of seq TRANSPOR-ID=<0x010010023922604c> >>>> >>>> I've turned on MDS debug on node B, and the packet being sent over is >>>> gigantic. It starts failing at fragment number 2703. The next >>>> fragment that comes in is 2707, then 2722. The last fragment that >>>> comes in is 7444. >>>> >>>> I've done a cursory look at the hardware stats, and nothing is being >>>> rate-limited or dropped. >>>> >>>> I'm going to take a deeper look at this, but I'm mentioning it in case >>>> it rings any bells. I am using TIPC as the transport. >>>> >>>> Alex >>>> >>>> On 01/07/2014 07:24 AM, Alex Jones wrote: >>>>> AVM, >>>>> >>>>> I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the >>>>> timeout value. Is this not a bug? the synchronous CheckpointOpen >>>>> call doesn't work at all in this scenario. It never succeeds. >>>>> >>>>> I can reproduce the problem with >>>>> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY. >>>>> >>>>> You should be able to reproduce the problem with the code I sent >>>>> in the last e-mail. >>>>> >>>>> Alex >>>>> >>>>> On 01/06/2014 10:31 PM, A V Mahesh wrote: >>>>>> Hi Alex, >>>>>> >>>>>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT NOT a bug , it >>>>>> is expected if you pass less time out value `timeout = 1000000000` >>>>>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very >>>>>> large data/section. just increasing timeout will avoids the >>>>>> SA_AIS_ERR_TIMEOUT. >>>>>> >>>>>> Let us focus on your original issue/scenario, are you able to >>>>>> reproduce the problem with sectionCreationAttributes.expirationTime >>>>>> with SA_TIME_ONE_DAY ? >>>>>> >>>>>> -AVM >>>>>> >>>>>> On 1/7/2014 1:17 AM, Alex Jones wrote: >>>>>>> AVM, >>>>>>> >>>>>>> I've been playing around with your test program, and have >>>>>>> gotten it to fail. >>>>>>> >>>>>>> I made the following changes: >>>>>>> >>>>>>> 1. Change init_dataX to be 1024k bytes, so that you are >>>>>>> initializing the section to be 1024k. >>>>>>> 2. Also, don't start the program on node B until A has finished >>>>>>> writing/creating all the sections. >>>>>>> 3. Before hitting the enter key on node B, wait for the OpenAsync >>>>>>> call to finish. >>>>>>> >>>>>>> You might notice the CheckpointOpen call failing now with >>>>>>> SA_AIS_ERR_TIMEOUT. I had to turn this into OpenAsync, and add a >>>>>>> thread to process CkptDispatch messages. This uncovers another bug >>>>>>> in OpenAsync. I've attached the mods to your program here. >>>>>>> >>>>>>> The OpenAsync callback will be called twice, both times with >>>>>>> error == SA_AIS_ERR_TIMEOUT. If I call OpenAsync again when I get >>>>>>> this error, the next callback returns success, but the callback >>>>>>> gets called twice with success and with two different checkpoint >>>>>>> handles! >>>>>>> >>>>>>> Alex >>>>>>> >>>>>>> >>>>>>> On 01/06/2014 06:18 AM, A V Mahesh wrote: >>>>>>>> Hi Alex, >>>>>>>> >>>>>>>> I have created 10K sections ( please find the attached test >>>>>>>> application `Alex_test_node_A_app.c` & `Alex_test_node_B_app.c ` ) >>>>>>>> with your specified scenario & configuration and I haven't observed any >>>>>>>> issue with sections on another node. >>>>>>>> >>>>>>>> Try to reproduce the problem on your setup & let me know the result . >>>>>>>> >>>>>>>> One more importent point how much did you configured >>>>>>>> `sectionCreationAttributes.expirationTime ` ? >>>>>>>> I configured SA_TIME_ONE_DAY. >>>>>>>> >>>>>>>> Steps to rung the application : >>>>>>>> >>>>>>>> >> ====================================================================================================== >>> ============= >>>>>>>> Compile : >>>>>>>> >>>>>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt >>>>>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt >>>>>>>> >>>>>>>> >>>>>>>> Run : >>>>>>>> >>>>>>>> 1) saCkptCheckpointOpen On node A >>>>>>>> >>>>>>>> NODE-A# ./checkpoint_A >>>>>>>> >>>>>>>> CPSV:CPA:ONsaCkptSectionCreate Waiting to Create Sections >>>>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... >>>>>>>> saCkptSectionCreate Press <Enter> key to continue... >>>>>>>> >>>>>>>> . >>>>>>>> 2) saCkptCheckpointOpen() same ckpt On node B >>>>>>>> >>>>>>>> NODE-B# ./checkpoint_B >>>>>>>> >>>>>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read Sections >>>>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... >>>>>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press <Enter> >>>>>>>> key to continue... >>>>>>>> >>>>>>>> >>>>>>>> 3) saCkptSectionCreate() On node A and read >>>>>>>> saCkptCheckpointStatusGet() >>>>>>>> >>>>>>>> NODE-A# >>>>>>>> checkpointStatus.numberOfSections : 10000 >>>>>>>> checkpointStatus.memoryUsed :756000 >>>>>>>> checkpointCreationAttributes.creationFlags;10 >>>>>>>> checkpointCreationAttributes.checkpointSize;10240000 >>>>>>>> checkpointCreationAttributes.retentionDuration;60000000000 >>>>>>>> checkpointCreationAttributes.maxSections;10000 >>>>>>>> checkpointCreationAttributes.maxSectionSize;1024 >>>>>>>> checkpointCreationAttributes.maxSectionIdSize;64 >>>>>>>> ================================ >>>>>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press >>>>>>>> <Enter> key to continue... >>>>>>>> saCkptCheckpoint Press <Enter> key to continue... >>>>>>>> >>>>>>>> >>>>>>>> 4) saCkptActiveReplicaSet() & On node B and >>>>>>>> saCkptCheckpointStatusGet() >>>>>>>> >>>>>>>> NODE-B# >>>>>>>> checkpointStatus.numberOfSections : 10000 >>>>>>>> checkpointStatus.memoryUsed :756000 >>>>>>>> checkpointCreationAttributes.creationFlags;10 >>>>>>>> checkpointCreationAttributes.checkpointSize;10240000 >>>>>>>> checkpointCreationAttributes.retentionDuration;60000000000 >>>>>>>> checkpointCreationAttributes.maxSections;10000 >>>>>>>> checkpointCreationAttributes.maxSectionSize;1024 >>>>>>>> checkpointCreationAttributes.maxSectionIdSize;64 >>>>>>>> >>>>>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize >>>>>>>> Press >>>>>>>> <Enter> key to continue... >>>>>>>> saCkptCheckpoint Press <Enter> key to continue.. >>>>>>>> >>>>>>>> >> ====================================================================================================== >>> ========================== >>>>>>>> -AVM >>>>>>>> >>>>>>>> >>>>>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote: >>>>>>>>> Hi Alex, >>>>>>>>> >>>>>>>>> We never tested the 7500 sections , will test & and let you know , >>>>>>>>> can you please share your test application , >>>>>>>>> that allow us to respond quick. >>>>>>>>> >>>>>>>>> -AVM >>>>>>>>> >>>>>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote: >>>>>>>>>> Hello All, >>>>>>>>>> >>>>>>>>>> I'm experimenting with the checkpoint service, and some things >>>>>>>>>> don't appear to work. >>>>>>>>>> >>>>>>>>>> The saCkptActiveReplicaSet and >>>>>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work when the >>>>>>>>>> checkpoint has section numbers greater than around 5500. >>>>>>>>>> >>>>>>>>>> I've created a checkpoint with 7500 sections, each section >>>>>>>>>> being >>>>>>>>>> 1024 bytes. The checkpoint is co-located and the "active replica" >>>>>>>>>> bit is set. >>>>>>>>>> >>>>>>>>>> I can create and write all the sections. And from another >>>>>>>>>> node >>>>>>>>>> I run saCkptCheckpointStatusGet, and the information all looks good. >>>>>>>>>> Everything is there. I see no errors from any CKPT API calls. >>>>>>>>>> >>>>>>>>>> The problem comes when I call saCkptActiveReplicaSet from this >>>>>>>>>> other node. After I do this, saCkptCheckpointStatusGet now returns >>>>>>>>>> all the same information except the number of sections is no longer >>>>>>>>>> 7500 but 0. If I do this test with 50,000 sections only about 3,000 >>>>>>>>>> entries get synced. And iterating through the sections shows that >>>>>>>>>> there are only 3,000 sections. >>>>>>>>>> >>>>>>>>>> Calling saCkptCheckpointSynchronize[Async] in this situation >>>>>>>>>> has >>>>>>>>>> no effect, either. >>>>>>>>>> >>>>>>>>>> After looking through the code I see a comment in >>>>>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is missing >>>>>>>>>> with old active if now this fellow is becoming active. */" So, it >>>>>>>>>> doesn't appear that syncing is being done in the >>>>>>>>>> saCkptActiveReplicaSet, which it should be. >>>>>>>>>> >>>>>>>>>> Can someone comment? >>>>>>>>>> >>>>>>>>>> I'm going to fix this and post a patch unless someone else is >>>>>>>>>> already working on it, but I didn't see a bug for it. >>>>>>>>>> >>>>>>>>>> Alex >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>> >>>>>>>>>> Rapidly troubleshoot problems before they affect your business. Most >>>>>>>>>> IT >>>>>>>>>> organizations don't have a clear picture of how application >>>>>>>>>> performance >>>>>>>>>> affects their revenue. With AppDynamics, you get 100% visibility into >>>>>>>>>> your >>>>>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >>>>>>>>>> AppDynamics Pro! >>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Opensaf-devel mailing list >>>>>>>>>> Opensaf-devel@lists.sourceforge.net >>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >>> ------------------------------------------------------------------------------ >>> Rapidly troubleshoot problems before they affect your business. Most IT >>> organizations don't have a clear picture of how application performance >>> affects their revenue. With AppDynamics, you get 100% visibility into your >>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics >>> Pro! >>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Opensaf-devel mailing list >>> Opensaf-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >> ------------------------------------------------------------------------------ >> Rapidly troubleshoot problems before they affect your business. Most IT >> organizations don't have a clear picture of how application performance >> affects their revenue. With AppDynamics, you get 100% visibility into your >> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics >> Pro! >> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk >> _______________________________________________ >> Opensaf-devel mailing list >> Opensaf-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel