The socket receive buffer size used is the system default. It can be too small, pump it up. I plan todo some change in MDS for this (and other stuff). /Hans
> -----Original Message----- > From: A V Mahesh [mailto:mahesh.va...@oracle.com] > Sent: den 8 januari 2014 11:29 > To: Alex Jones > Cc: opensaf-devel@lists.sourceforge.net > Subject: Re: [devel] checkpoint problems > > Hi Alex, > > I suggest you increase and try the following TIPC values ( tipc code ) > and rebuild `tipc.ko`: > > net/tipc/tipc_socket.c:#define OVERLOAD_LIMIT_BASE 5000 > > You can increase it to 50000 and try again. > > - AVM. > > On 1/8/2014 4:16 AM, Alex Jones wrote: > > After doing some deep debugging I am seeing the following in the MDS > > log on node B. This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is > > sent from the active replica on node A to the replica on node B. The > > sync message never gets up to the CPND layer on node B because it is > > dropped. > > > > This is with 10k sections, each section 1k. > > > > Jan 7 21:32:32.772347 <1789648919> ERR |MDTM: Frag recd is not > > next frag so dropping adest=<0x010010023922604c> > > Jan 7 21:32:32.772399 <1789648919> ERR |MDTM: Message is dropped > > as msg is out of seq TRANSPOR-ID=<0x010010023922604c> > > > > I've turned on MDS debug on node B, and the packet being sent over is > > gigantic. It starts failing at fragment number 2703. The next > > fragment that comes in is 2707, then 2722. The last fragment that > > comes in is 7444. > > > > I've done a cursory look at the hardware stats, and nothing is being > > rate-limited or dropped. > > > > I'm going to take a deeper look at this, but I'm mentioning it in case > > it rings any bells. I am using TIPC as the transport. > > > > Alex > > > > On 01/07/2014 07:24 AM, Alex Jones wrote: > >> AVM, > >> > >> I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the > >> timeout value. Is this not a bug? the synchronous CheckpointOpen > >> call doesn't work at all in this scenario. It never succeeds. > >> > >> I can reproduce the problem with > >> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY. > >> > >> You should be able to reproduce the problem with the code I sent > >> in the last e-mail. > >> > >> Alex > >> > >> On 01/06/2014 10:31 PM, A V Mahesh wrote: > >>> Hi Alex, > >>> > >>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT NOT a bug , it > >>> is expected if you pass less time out value `timeout = 1000000000` > >>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very > >>> large data/section. just increasing timeout will avoids the > >>> SA_AIS_ERR_TIMEOUT. > >>> > >>> Let us focus on your original issue/scenario, are you able to > >>> reproduce the problem with sectionCreationAttributes.expirationTime > >>> with SA_TIME_ONE_DAY ? > >>> > >>> -AVM > >>> > >>> On 1/7/2014 1:17 AM, Alex Jones wrote: > >>>> AVM, > >>>> > >>>> I've been playing around with your test program, and have > >>>> gotten it to fail. > >>>> > >>>> I made the following changes: > >>>> > >>>> 1. Change init_dataX to be 1024k bytes, so that you are > >>>> initializing the section to be 1024k. > >>>> 2. Also, don't start the program on node B until A has finished > >>>> writing/creating all the sections. > >>>> 3. Before hitting the enter key on node B, wait for the OpenAsync > >>>> call to finish. > >>>> > >>>> You might notice the CheckpointOpen call failing now with > >>>> SA_AIS_ERR_TIMEOUT. I had to turn this into OpenAsync, and add a > >>>> thread to process CkptDispatch messages. This uncovers another bug > >>>> in OpenAsync. I've attached the mods to your program here. > >>>> > >>>> The OpenAsync callback will be called twice, both times with > >>>> error == SA_AIS_ERR_TIMEOUT. If I call OpenAsync again when I get > >>>> this error, the next callback returns success, but the callback > >>>> gets called twice with success and with two different checkpoint > >>>> handles! > >>>> > >>>> Alex > >>>> > >>>> > >>>> On 01/06/2014 06:18 AM, A V Mahesh wrote: > >>>>> Hi Alex, > >>>>> > >>>>> I have created 10K sections ( please find the attached test > >>>>> application `Alex_test_node_A_app.c` & `Alex_test_node_B_app.c ` ) > >>>>> with your specified scenario & configuration and I haven't observed any > >>>>> issue with sections on another node. > >>>>> > >>>>> Try to reproduce the problem on your setup & let me know the result . > >>>>> > >>>>> One more importent point how much did you configured > >>>>> `sectionCreationAttributes.expirationTime ` ? > >>>>> I configured SA_TIME_ONE_DAY. > >>>>> > >>>>> Steps to rung the application : > >>>>> > >>>>> > ====================================================================================================== > ============= > >>>>> > >>>>> Compile : > >>>>> > >>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt > >>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt > >>>>> > >>>>> > >>>>> Run : > >>>>> > >>>>> 1) saCkptCheckpointOpen On node A > >>>>> > >>>>> NODE-A# ./checkpoint_A > >>>>> > >>>>> CPSV:CPA:ONsaCkptSectionCreate Waiting to Create Sections > >>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... > >>>>> saCkptSectionCreate Press <Enter> key to continue... > >>>>> > >>>>> . > >>>>> 2) saCkptCheckpointOpen() same ckpt On node B > >>>>> > >>>>> NODE-B# ./checkpoint_B > >>>>> > >>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read Sections > >>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... > >>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press <Enter> > >>>>> key to continue... > >>>>> > >>>>> > >>>>> 3) saCkptSectionCreate() On node A and read saCkptCheckpointStatusGet() > >>>>> > >>>>> NODE-A# > >>>>> checkpointStatus.numberOfSections : 10000 > >>>>> checkpointStatus.memoryUsed :756000 > >>>>> checkpointCreationAttributes.creationFlags;10 > >>>>> checkpointCreationAttributes.checkpointSize;10240000 > >>>>> checkpointCreationAttributes.retentionDuration;60000000000 > >>>>> checkpointCreationAttributes.maxSections;10000 > >>>>> checkpointCreationAttributes.maxSectionSize;1024 > >>>>> checkpointCreationAttributes.maxSectionIdSize;64 > >>>>> ================================ > >>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press > >>>>> <Enter> key to continue... > >>>>> saCkptCheckpoint Press <Enter> key to continue... > >>>>> > >>>>> > >>>>> 4) saCkptActiveReplicaSet() & On node B and saCkptCheckpointStatusGet() > >>>>> > >>>>> NODE-B# > >>>>> checkpointStatus.numberOfSections : 10000 > >>>>> checkpointStatus.memoryUsed :756000 > >>>>> checkpointCreationAttributes.creationFlags;10 > >>>>> checkpointCreationAttributes.checkpointSize;10240000 > >>>>> checkpointCreationAttributes.retentionDuration;60000000000 > >>>>> checkpointCreationAttributes.maxSections;10000 > >>>>> checkpointCreationAttributes.maxSectionSize;1024 > >>>>> checkpointCreationAttributes.maxSectionIdSize;64 > >>>>> > >>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press > >>>>> <Enter> key to continue... > >>>>> saCkptCheckpoint Press <Enter> key to continue.. > >>>>> > >>>>> > ====================================================================================================== > ========================== > >>>>> > >>>>> -AVM > >>>>> > >>>>> > >>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote: > >>>>>> Hi Alex, > >>>>>> > >>>>>> We never tested the 7500 sections , will test & and let you know , > >>>>>> can you please share your test application , > >>>>>> that allow us to respond quick. > >>>>>> > >>>>>> -AVM > >>>>>> > >>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote: > >>>>>>> Hello All, > >>>>>>> > >>>>>>> I'm experimenting with the checkpoint service, and some things > >>>>>>> don't appear to work. > >>>>>>> > >>>>>>> The saCkptActiveReplicaSet and > >>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work when the > >>>>>>> checkpoint has section numbers greater than around 5500. > >>>>>>> > >>>>>>> I've created a checkpoint with 7500 sections, each section being > >>>>>>> 1024 bytes. The checkpoint is co-located and the "active replica" > >>>>>>> bit is set. > >>>>>>> > >>>>>>> I can create and write all the sections. And from another node > >>>>>>> I run saCkptCheckpointStatusGet, and the information all looks good. > >>>>>>> Everything is there. I see no errors from any CKPT API calls. > >>>>>>> > >>>>>>> The problem comes when I call saCkptActiveReplicaSet from this > >>>>>>> other node. After I do this, saCkptCheckpointStatusGet now returns > >>>>>>> all the same information except the number of sections is no longer > >>>>>>> 7500 but 0. If I do this test with 50,000 sections only about 3,000 > >>>>>>> entries get synced. And iterating through the sections shows that > >>>>>>> there are only 3,000 sections. > >>>>>>> > >>>>>>> Calling saCkptCheckpointSynchronize[Async] in this situation has > >>>>>>> no effect, either. > >>>>>>> > >>>>>>> After looking through the code I see a comment in > >>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is missing > >>>>>>> with old active if now this fellow is becoming active. */" So, it > >>>>>>> doesn't appear that syncing is being done in the > >>>>>>> saCkptActiveReplicaSet, which it should be. > >>>>>>> > >>>>>>> Can someone comment? > >>>>>>> > >>>>>>> I'm going to fix this and post a patch unless someone else is > >>>>>>> already working on it, but I didn't see a bug for it. > >>>>>>> > >>>>>>> Alex > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> ------------------------------------------------------------------------------ > >>>>>>> > >>>>>>> Rapidly troubleshoot problems before they affect your business. Most > >>>>>>> IT > >>>>>>> organizations don't have a clear picture of how application > >>>>>>> performance > >>>>>>> affects their revenue. With AppDynamics, you get 100% visibility into > >>>>>>> your > >>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of > >>>>>>> AppDynamics Pro! > >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Opensaf-devel mailing list > >>>>>>> Opensaf-devel@lists.sourceforge.net > >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel > >>>> > >>> > >> > > > > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! > http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk > _______________________________________________ > Opensaf-devel mailing list > Opensaf-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel