Hi Alex, I suggest you increase and try the following TIPC values ( tipc code ) and rebuild `tipc.ko`:
net/tipc/tipc_socket.c:#define OVERLOAD_LIMIT_BASE 5000 You can increase it to 50000 and try again. - AVM. On 1/8/2014 4:16 AM, Alex Jones wrote: > After doing some deep debugging I am seeing the following in the MDS > log on node B. This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is > sent from the active replica on node A to the replica on node B. The > sync message never gets up to the CPND layer on node B because it is > dropped. > > This is with 10k sections, each section 1k. > > Jan 7 21:32:32.772347 <1789648919> ERR |MDTM: Frag recd is not > next frag so dropping adest=<0x010010023922604c> > Jan 7 21:32:32.772399 <1789648919> ERR |MDTM: Message is dropped > as msg is out of seq TRANSPOR-ID=<0x010010023922604c> > > I've turned on MDS debug on node B, and the packet being sent over is > gigantic. It starts failing at fragment number 2703. The next > fragment that comes in is 2707, then 2722. The last fragment that > comes in is 7444. > > I've done a cursory look at the hardware stats, and nothing is being > rate-limited or dropped. > > I'm going to take a deeper look at this, but I'm mentioning it in case > it rings any bells. I am using TIPC as the transport. > > Alex > > On 01/07/2014 07:24 AM, Alex Jones wrote: >> AVM, >> >> I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the >> timeout value. Is this not a bug? the synchronous CheckpointOpen >> call doesn't work at all in this scenario. It never succeeds. >> >> I can reproduce the problem with >> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY. >> >> You should be able to reproduce the problem with the code I sent >> in the last e-mail. >> >> Alex >> >> On 01/06/2014 10:31 PM, A V Mahesh wrote: >>> Hi Alex, >>> >>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT NOT a bug , it >>> is expected if you pass less time out value `timeout = 1000000000` >>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very >>> large data/section. just increasing timeout will avoids the >>> SA_AIS_ERR_TIMEOUT. >>> >>> Let us focus on your original issue/scenario, are you able to >>> reproduce the problem with sectionCreationAttributes.expirationTime >>> with SA_TIME_ONE_DAY ? >>> >>> -AVM >>> >>> On 1/7/2014 1:17 AM, Alex Jones wrote: >>>> AVM, >>>> >>>> I've been playing around with your test program, and have >>>> gotten it to fail. >>>> >>>> I made the following changes: >>>> >>>> 1. Change init_dataX to be 1024k bytes, so that you are >>>> initializing the section to be 1024k. >>>> 2. Also, don't start the program on node B until A has finished >>>> writing/creating all the sections. >>>> 3. Before hitting the enter key on node B, wait for the OpenAsync >>>> call to finish. >>>> >>>> You might notice the CheckpointOpen call failing now with >>>> SA_AIS_ERR_TIMEOUT. I had to turn this into OpenAsync, and add a >>>> thread to process CkptDispatch messages. This uncovers another bug >>>> in OpenAsync. I've attached the mods to your program here. >>>> >>>> The OpenAsync callback will be called twice, both times with >>>> error == SA_AIS_ERR_TIMEOUT. If I call OpenAsync again when I get >>>> this error, the next callback returns success, but the callback >>>> gets called twice with success and with two different checkpoint >>>> handles! >>>> >>>> Alex >>>> >>>> >>>> On 01/06/2014 06:18 AM, A V Mahesh wrote: >>>>> Hi Alex, >>>>> >>>>> I have created 10K sections ( please find the attached test >>>>> application `Alex_test_node_A_app.c` & `Alex_test_node_B_app.c ` ) >>>>> with your specified scenario & configuration and I haven't observed any >>>>> issue with sections on another node. >>>>> >>>>> Try to reproduce the problem on your setup & let me know the result . >>>>> >>>>> One more importent point how much did you configured >>>>> `sectionCreationAttributes.expirationTime ` ? >>>>> I configured SA_TIME_ONE_DAY. >>>>> >>>>> Steps to rung the application : >>>>> >>>>> =================================================================================================================== >>>>> >>>>> Compile : >>>>> >>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt >>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt >>>>> >>>>> >>>>> Run : >>>>> >>>>> 1) saCkptCheckpointOpen On node A >>>>> >>>>> NODE-A# ./checkpoint_A >>>>> >>>>> CPSV:CPA:ONsaCkptSectionCreate Waiting to Create Sections >>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... >>>>> saCkptSectionCreate Press <Enter> key to continue... >>>>> >>>>> . >>>>> 2) saCkptCheckpointOpen() same ckpt On node B >>>>> >>>>> NODE-B# ./checkpoint_B >>>>> >>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read Sections >>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService.... >>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press <Enter> >>>>> key to continue... >>>>> >>>>> >>>>> 3) saCkptSectionCreate() On node A and read saCkptCheckpointStatusGet() >>>>> >>>>> NODE-A# >>>>> checkpointStatus.numberOfSections : 10000 >>>>> checkpointStatus.memoryUsed :756000 >>>>> checkpointCreationAttributes.creationFlags;10 >>>>> checkpointCreationAttributes.checkpointSize;10240000 >>>>> checkpointCreationAttributes.retentionDuration;60000000000 >>>>> checkpointCreationAttributes.maxSections;10000 >>>>> checkpointCreationAttributes.maxSectionSize;1024 >>>>> checkpointCreationAttributes.maxSectionIdSize;64 >>>>> ================================ >>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press >>>>> <Enter> key to continue... >>>>> saCkptCheckpoint Press <Enter> key to continue... >>>>> >>>>> >>>>> 4) saCkptActiveReplicaSet() & On node B and saCkptCheckpointStatusGet() >>>>> >>>>> NODE-B# >>>>> checkpointStatus.numberOfSections : 10000 >>>>> checkpointStatus.memoryUsed :756000 >>>>> checkpointCreationAttributes.creationFlags;10 >>>>> checkpointCreationAttributes.checkpointSize;10240000 >>>>> checkpointCreationAttributes.retentionDuration;60000000000 >>>>> checkpointCreationAttributes.maxSections;10000 >>>>> checkpointCreationAttributes.maxSectionSize;1024 >>>>> checkpointCreationAttributes.maxSectionIdSize;64 >>>>> >>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / saCkptFinalize Press >>>>> <Enter> key to continue... >>>>> saCkptCheckpoint Press <Enter> key to continue.. >>>>> >>>>> ================================================================================================================================ >>>>> >>>>> -AVM >>>>> >>>>> >>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote: >>>>>> Hi Alex, >>>>>> >>>>>> We never tested the 7500 sections , will test & and let you know , >>>>>> can you please share your test application , >>>>>> that allow us to respond quick. >>>>>> >>>>>> -AVM >>>>>> >>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote: >>>>>>> Hello All, >>>>>>> >>>>>>> I'm experimenting with the checkpoint service, and some things >>>>>>> don't appear to work. >>>>>>> >>>>>>> The saCkptActiveReplicaSet and >>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work when the >>>>>>> checkpoint has section numbers greater than around 5500. >>>>>>> >>>>>>> I've created a checkpoint with 7500 sections, each section being >>>>>>> 1024 bytes. The checkpoint is co-located and the "active replica" >>>>>>> bit is set. >>>>>>> >>>>>>> I can create and write all the sections. And from another node >>>>>>> I run saCkptCheckpointStatusGet, and the information all looks good. >>>>>>> Everything is there. I see no errors from any CKPT API calls. >>>>>>> >>>>>>> The problem comes when I call saCkptActiveReplicaSet from this >>>>>>> other node. After I do this, saCkptCheckpointStatusGet now returns >>>>>>> all the same information except the number of sections is no longer >>>>>>> 7500 but 0. If I do this test with 50,000 sections only about 3,000 >>>>>>> entries get synced. And iterating through the sections shows that >>>>>>> there are only 3,000 sections. >>>>>>> >>>>>>> Calling saCkptCheckpointSynchronize[Async] in this situation has >>>>>>> no effect, either. >>>>>>> >>>>>>> After looking through the code I see a comment in >>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is missing >>>>>>> with old active if now this fellow is becoming active. */" So, it >>>>>>> doesn't appear that syncing is being done in the >>>>>>> saCkptActiveReplicaSet, which it should be. >>>>>>> >>>>>>> Can someone comment? >>>>>>> >>>>>>> I'm going to fix this and post a patch unless someone else is >>>>>>> already working on it, but I didn't see a bug for it. >>>>>>> >>>>>>> Alex >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> >>>>>>> Rapidly troubleshoot problems before they affect your business. Most IT >>>>>>> organizations don't have a clear picture of how application performance >>>>>>> affects their revenue. With AppDynamics, you get 100% visibility into >>>>>>> your >>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of >>>>>>> AppDynamics Pro! >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Opensaf-devel mailing list >>>>>>> [email protected] >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel >>>> >>> >> > ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
