Hi Guys,

     After doing some more testing I'm still seeing some problems.

     The patch worked fine for a 2N model, but our real requirements are 
a little different.

     Here's the setup.  5+1 redundancy.  6 active blades and 1 standby 
blade protecting all the other blades.  I am creating a check point on 
each active blade, and the standby is opening all 5 checkpoints to do 
the backup.

     40k sections on each checkpoint, and 1k of data in each section.

     Every so often I am still seeing MDS problems, but they are 
different now with the patch.

Jan 10 16:20:35.776939 <3680821261> ERR    |MDS_SND_RCV: Timeout or 
Error occured
Jan 10 16:20:35.777031 <3680821261> ERR    |MDS_SND_RCV: Timeout occured 
on sndrsp message
Jan 10 16:20:35.777062 <3680821261> ERR    |MDS_SND_RCV: 
Adest=<0x0002040f,3798024214>

Jan 10 16:20:50.098279 <3680821261> ERR 
|LEN-MISMATCH:recvd_on_sock=8034, size_in_mds_hdr=65034, 
TIPC-ID=0x010010056ab7600b, ADEST=<0002050f,1790402571>
Jan 10 16:20:50.098326 <3680821261> ERR    |DUMP:Changing 
dump-extent:buff=0x998fa300:max=100, len=8034
Jan 10 16:20:50.098348 <3680821261> ERR |DUMP:buff=0x998fa300:offset=  0 
to   7:Bytes = 0xfe 0x0a 0x00 0x00 : 0x0f 0x1e 0x80 0x01

Alex

On 01/09/2014 04:43 AM, A V Mahesh wrote:
> Hi Alex,
>
> Use the below patch as workaround for you to proceed your testing .
> This patch just increases the MDS internal fragmentation value to
> ~ TIPC_MAX_USER_MSG_SIZE  define in tipc.h
>
> I will work with  Hans to have final patch  by considering the both 
> TIPC & TCP transports,
> and testing involved as a part of ticket `#654 MDS improvements` 
> (https://sourceforge.net/p/opensaf/tickets/654/ ).
>
> I tested this patch with 10K sections checkpoint memory used was : 
> 10136000  on TIPC transport.
>
> ==================================================================================
>  
>
> diff --git a/osaf/libs/core/mds/include/mds_dt.h 
> b/osaf/libs/core/mds/include/mds_dt.h
> --- a/osaf/libs/core/mds/include/mds_dt.h
> +++ b/osaf/libs/core/mds/include/mds_dt.h
> @@ -32,6 +32,7 @@
>  #include "ncs_main_papi.h"
>  #include "ncssysf_mem.h"
>  #include "ncspatricia.h"
> +#include <linux/tipc.h>
>
>
>  /* This file is private to the MDTM layer. */
> @@ -109,7 +110,7 @@ typedef struct mdtm_reassembly_queue {
>
>  #define MDTM_MAX_DIRECT_BUFF_SIZE  MDTM_MAX_SEGMENT_SIZE
>
> -#define MDTM_NORMAL_MSG_FRAG_SIZE   1400
> +#define MDTM_NORMAL_MSG_FRAG_SIZE  (TIPC_MAX_USER_MSG_SIZE-1000) /* 
> TIPC_MAX_USER_MSG_SIZE = 66000 define <linux/tipc.h> */
>
>  #define MDTM_RECV_BUFFER_SIZE 
> ((MDS_DIRECT_BUF_MAXSIZE>MDTM_NORMAL_MSG_FRAG_SIZE)? \
> (MDS_DIRECT_BUF_MAXSIZE+SUM_MDS_HDR_PLUS_MDTM_HDR_PLUS_LEN):(MDTM_NORMAL_MSG_FRAG_SIZE+SUM_MDS_HDR_PLUS_MDTM_HDR_PLUS_LEN))
>  
>
> ==================================================================================
>  
>
>
> -AVM
>
>
> On 1/8/2014 10:42 PM, Alex Jones wrote:
>> Hi Hans,
>>
>>     Changing rmem_default and rmem_max has no effect on the problem.  
>> I even tried up to 2M to no avail.
>>
>>     However, after looking at the cpnd_transfer_replica function in 
>> cpnd_evt.c, I found the following in cpsv_evt.h which controls how 
>> large the packets are which are sent through MDS:
>>
>> #define MAX_SYNC_TRANSFER_SIZE           (30 * 1024 * 1024)
>>
>>     30M?  What is the rationale for this number?  This seems way too 
>> high.  When I change it to (4*1024*1024) (4M) it solves my problem, 
>> and doesn't appear to affect performance.
>>
>> Alex
>>
>> On 01/08/2014 08:30 AM, Hans Feldt wrote:
>>> sysctl -a | grep rmem
>>>
>>> set rmem_default to 256K or so
>>>
>>> /Hans
>>>
>>>> -----Original Message-----
>>>> From: Hans Feldt [mailto:hans.fe...@ericsson.com]
>>>> Sent: den 8 januari 2014 14:01
>>>> To: A V Mahesh; Alex Jones
>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>> Subject: Re: [devel] checkpoint problems
>>>>
>>>> The socket receive buffer size used is the system default. It can 
>>>> be too small, pump it up.
>>>> I plan todo some change in MDS for this (and other stuff).
>>>> /Hans
>>>>
>>>>> -----Original Message-----
>>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>>>>> Sent: den 8 januari 2014 11:29
>>>>> To: Alex Jones
>>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>>> Subject: Re: [devel] checkpoint problems
>>>>>
>>>>> Hi Alex,
>>>>>
>>>>> I suggest you increase and try the following TIPC values ( tipc 
>>>>> code )
>>>>> and rebuild `tipc.ko`:
>>>>>
>>>>> net/tipc/tipc_socket.c:#define OVERLOAD_LIMIT_BASE 5000
>>>>>
>>>>> You can increase it to 50000 and try again.
>>>>>
>>>>> - AVM.
>>>>>
>>>>> On 1/8/2014 4:16 AM, Alex Jones wrote:
>>>>>> After doing some deep debugging I am seeing the following in the MDS
>>>>>> log on node B.  This is when the CPND_EVT_ND2ND_CKPT_ACTIVE_SYNC is
>>>>>> sent from the active replica on node A to the replica on node B.  
>>>>>> The
>>>>>> sync message never gets up to the CPND layer on node B because it is
>>>>>> dropped.
>>>>>>
>>>>>> This is with 10k sections, each section 1k.
>>>>>>
>>>>>> Jan  7 21:32:32.772347 <1789648919> ERR    |MDTM: Frag recd is not
>>>>>> next frag so dropping adest=<0x010010023922604c>
>>>>>> Jan  7 21:32:32.772399 <1789648919> ERR    |MDTM: Message is dropped
>>>>>> as msg is out of seq TRANSPOR-ID=<0x010010023922604c>
>>>>>>
>>>>>> I've turned on MDS debug on node B, and the packet being sent 
>>>>>> over is
>>>>>> gigantic.  It starts failing at fragment number 2703. The next
>>>>>> fragment that comes in is 2707, then 2722.  The last fragment that
>>>>>> comes in is 7444.
>>>>>>
>>>>>> I've done a cursory look at the hardware stats, and nothing is being
>>>>>> rate-limited or dropped.
>>>>>>
>>>>>> I'm going to take a deeper look at this, but I'm mentioning it in 
>>>>>> case
>>>>>> it rings any bells.  I am using TIPC as the transport.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On 01/07/2014 07:24 AM, Alex Jones wrote:
>>>>>>> AVM,
>>>>>>>
>>>>>>>      I get SA_AIS_ERR_TIMEOUT even when I pass SA_TIME_END as the
>>>>>>> timeout value.  Is this not a bug?  the synchronous CheckpointOpen
>>>>>>> call doesn't work at all in this scenario.  It never succeeds.
>>>>>>>
>>>>>>>      I can reproduce the problem with
>>>>>>> sectionCreationAttributes.expirationTime set to SA_TIME_ONE_DAY.
>>>>>>>
>>>>>>>      You should be able to reproduce the problem with the code I 
>>>>>>> sent
>>>>>>> in the last e-mail.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>> On 01/06/2014 10:31 PM, A V Mahesh wrote:
>>>>>>>> Hi Alex,
>>>>>>>>
>>>>>>>> CheckpointOpen call failing with SA_AIS_ERR_TIMEOUT NOT a bug , it
>>>>>>>> is expected if you pass  less time out value `timeout = 
>>>>>>>> 1000000000`
>>>>>>>> to saCkptCheckpointOpen(....,timeout ...) call ,when ckpt has very
>>>>>>>> large data/section. just increasing timeout will avoids the
>>>>>>>> SA_AIS_ERR_TIMEOUT.
>>>>>>>>
>>>>>>>> Let us focus on your original issue/scenario, are you able to
>>>>>>>> reproduce the  problem with 
>>>>>>>> sectionCreationAttributes.expirationTime
>>>>>>>> with SA_TIME_ONE_DAY ?
>>>>>>>>
>>>>>>>> -AVM
>>>>>>>>
>>>>>>>> On 1/7/2014 1:17 AM, Alex Jones wrote:
>>>>>>>>> AVM,
>>>>>>>>>
>>>>>>>>>      I've been playing around with your test program, and have
>>>>>>>>> gotten it to fail.
>>>>>>>>>
>>>>>>>>>      I made the following changes:
>>>>>>>>>
>>>>>>>>>   1. Change init_dataX to be 1024k bytes, so that you are
>>>>>>>>>      initializing the section to be 1024k.
>>>>>>>>>   2. Also, don't start the program on node B until A has finished
>>>>>>>>>      writing/creating all the sections.
>>>>>>>>>   3. Before hitting the enter key on node B, wait for the 
>>>>>>>>> OpenAsync
>>>>>>>>>      call to finish.
>>>>>>>>>
>>>>>>>>>      You might notice the CheckpointOpen call failing now with
>>>>>>>>> SA_AIS_ERR_TIMEOUT.  I had to turn this into OpenAsync, and add a
>>>>>>>>> thread to process CkptDispatch messages.  This uncovers 
>>>>>>>>> another bug
>>>>>>>>> in OpenAsync.  I've attached the mods to your program here.
>>>>>>>>>
>>>>>>>>>     The OpenAsync callback will be called twice, both times with
>>>>>>>>> error == SA_AIS_ERR_TIMEOUT.  If I call OpenAsync again when I 
>>>>>>>>> get
>>>>>>>>> this error, the next callback returns success, but the callback
>>>>>>>>> gets called twice with success and with two different checkpoint
>>>>>>>>> handles!
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 01/06/2014 06:18 AM, A V Mahesh wrote:
>>>>>>>>>> Hi Alex,
>>>>>>>>>>
>>>>>>>>>> I have  created 10K sections  ( please find the attached test
>>>>>>>>>> application  `Alex_test_node_A_app.c`  & 
>>>>>>>>>> `Alex_test_node_B_app.c ` )
>>>>>>>>>> with your specified scenario & configuration and I haven't 
>>>>>>>>>> observed any
>>>>>>>>>> issue with  sections  on another node.
>>>>>>>>>>
>>>>>>>>>> Try to reproduce the problem on your setup & let me know the 
>>>>>>>>>> result .
>>>>>>>>>>
>>>>>>>>>> One more importent point how much did you configured
>>>>>>>>>> `sectionCreationAttributes.expirationTime `  ?
>>>>>>>>>> I configured  SA_TIME_ONE_DAY.
>>>>>>>>>>
>>>>>>>>>> Steps to rung the application :
>>>>>>>>>>
>>>>>>>>>>
>>>> ======================================================================================================
>>>>  
>>>>
>>>>> =============
>>>>>>>>>> Compile :
>>>>>>>>>>
>>>>>>>>>> NODE-A# gcc Alex_test_node_A_app.c -o checkpoint_A -lSaCkpt
>>>>>>>>>> NODE-A# gcc Alex_test_node_B_app.c -o checkpoint_B -lSaCkpt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Run :
>>>>>>>>>>
>>>>>>>>>> 1) saCkptCheckpointOpen On node A
>>>>>>>>>>
>>>>>>>>>> NODE-A# ./checkpoint_A
>>>>>>>>>>
>>>>>>>>>> CPSV:CPA:ONsaCkptSectionCreate  Waiting to Create Sections
>>>>>>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>>>>>>>> saCkptSectionCreate Press <Enter> key to continue...
>>>>>>>>>>
>>>>>>>>>> .
>>>>>>>>>> 2) saCkptCheckpointOpen() same ckpt On node B
>>>>>>>>>>
>>>>>>>>>> NODE-B# ./checkpoint_B
>>>>>>>>>>
>>>>>>>>>> CPSV:CPA:ONsaCkptSectionIterationInitialize Waiting to read 
>>>>>>>>>> Sections
>>>>>>>>>> safCkpt=test_checkpoint_name1,safApp=safCkptService....
>>>>>>>>>> saCkptActiveReplicaSet saCkptSectionIterationInitialize Press 
>>>>>>>>>> <Enter>
>>>>>>>>>> key to continue...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 3) saCkptSectionCreate() On node A  and read 
>>>>>>>>>> saCkptCheckpointStatusGet()
>>>>>>>>>>
>>>>>>>>>> NODE-A#
>>>>>>>>>>     checkpointStatus.numberOfSections : 10000
>>>>>>>>>>     checkpointStatus.memoryUsed :756000
>>>>>>>>>> checkpointCreationAttributes.creationFlags;10
>>>>>>>>>> checkpointCreationAttributes.checkpointSize;10240000
>>>>>>>>>> checkpointCreationAttributes.retentionDuration;60000000000
>>>>>>>>>> checkpointCreationAttributes.maxSections;10000
>>>>>>>>>> checkpointCreationAttributes.maxSectionSize;1024
>>>>>>>>>> checkpointCreationAttributes.maxSectionIdSize;64
>>>>>>>>>>     ================================
>>>>>>>>>> saCkptCheckpointUnlink / saCkptCheckpointClose / 
>>>>>>>>>> saCkptFinalize Press
>>>>>>>>>> <Enter> key to continue...
>>>>>>>>>> saCkptCheckpoint Press <Enter> key to continue...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 4) saCkptActiveReplicaSet() & On node B  and 
>>>>>>>>>> saCkptCheckpointStatusGet()
>>>>>>>>>>
>>>>>>>>>> NODE-B#
>>>>>>>>>>     checkpointStatus.numberOfSections : 10000
>>>>>>>>>>     checkpointStatus.memoryUsed :756000
>>>>>>>>>> checkpointCreationAttributes.creationFlags;10
>>>>>>>>>> checkpointCreationAttributes.checkpointSize;10240000
>>>>>>>>>> checkpointCreationAttributes.retentionDuration;60000000000
>>>>>>>>>> checkpointCreationAttributes.maxSections;10000
>>>>>>>>>> checkpointCreationAttributes.maxSectionSize;1024
>>>>>>>>>> checkpointCreationAttributes.maxSectionIdSize;64
>>>>>>>>>>
>>>>>>>>>>     saCkptCheckpointUnlink / saCkptCheckpointClose / 
>>>>>>>>>> saCkptFinalize Press
>>>>>>>>>> <Enter> key to continue...
>>>>>>>>>>     saCkptCheckpoint Press <Enter> key to continue..
>>>>>>>>>>
>>>>>>>>>>
>>>> ======================================================================================================
>>>>  
>>>>
>>>>> ==========================
>>>>>>>>>> -AVM
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 1/6/2014 12:32 PM, A V Mahesh wrote:
>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>
>>>>>>>>>>> We never tested the  7500 sections , will test & and let you 
>>>>>>>>>>> know ,
>>>>>>>>>>> can you please share your test application ,
>>>>>>>>>>>    that allow us to respond quick.
>>>>>>>>>>>
>>>>>>>>>>> -AVM
>>>>>>>>>>>
>>>>>>>>>>> On 1/3/2014 8:23 PM, Alex Jones wrote:
>>>>>>>>>>>> Hello All,
>>>>>>>>>>>>
>>>>>>>>>>>>        I'm experimenting with the checkpoint service, and 
>>>>>>>>>>>> some things
>>>>>>>>>>>> don't appear to work.
>>>>>>>>>>>>
>>>>>>>>>>>>        The saCkptActiveReplicaSet and
>>>>>>>>>>>> saCkptCheckpointSynchronize[Async] don't appear to work 
>>>>>>>>>>>> when the
>>>>>>>>>>>> checkpoint has section numbers greater than around 5500.
>>>>>>>>>>>>
>>>>>>>>>>>>        I've created a checkpoint with 7500 sections, each 
>>>>>>>>>>>> section being
>>>>>>>>>>>> 1024 bytes.  The checkpoint is co-located and the "active 
>>>>>>>>>>>> replica"
>>>>>>>>>>>> bit is set.
>>>>>>>>>>>>
>>>>>>>>>>>>        I can create and write all the sections.  And from 
>>>>>>>>>>>> another node
>>>>>>>>>>>> I run saCkptCheckpointStatusGet, and the information all 
>>>>>>>>>>>> looks good.
>>>>>>>>>>>> Everything is there.  I see no errors from any CKPT API calls.
>>>>>>>>>>>>
>>>>>>>>>>>>        The problem comes when I call saCkptActiveReplicaSet 
>>>>>>>>>>>> from this
>>>>>>>>>>>> other node.  After I do this, saCkptCheckpointStatusGet now 
>>>>>>>>>>>> returns
>>>>>>>>>>>> all the same information except the number of sections is 
>>>>>>>>>>>> no longer
>>>>>>>>>>>> 7500 but 0.  If I do this test with 50,000 sections only 
>>>>>>>>>>>> about 3,000
>>>>>>>>>>>> entries get synced.  And iterating through the sections 
>>>>>>>>>>>> shows that
>>>>>>>>>>>> there are only 3,000 sections.
>>>>>>>>>>>>
>>>>>>>>>>>>        Calling saCkptCheckpointSynchronize[Async] in this 
>>>>>>>>>>>> situation has
>>>>>>>>>>>> no effect, either.
>>>>>>>>>>>>
>>>>>>>>>>>>        After looking through the code I see a comment in
>>>>>>>>>>>> cpnd_evt_proc_ckpt_arep_set that says "/* ###TBD sync up is 
>>>>>>>>>>>> missing
>>>>>>>>>>>> with old active if now this fellow is becoming active. */"  
>>>>>>>>>>>> So, it
>>>>>>>>>>>> doesn't appear that syncing is being done in the
>>>>>>>>>>>> saCkptActiveReplicaSet, which it should be.
>>>>>>>>>>>>
>>>>>>>>>>>>        Can someone comment?
>>>>>>>>>>>>
>>>>>>>>>>>>        I'm going to fix this and post a patch unless 
>>>>>>>>>>>> someone else is
>>>>>>>>>>>> already working on it, but I didn't see a bug for it.
>>>>>>>>>>>>
>>>>>>>>>>>> Alex
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ------------------------------------------------------------------------------
>>>>>>>>>>>>  
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Rapidly troubleshoot problems before they affect your 
>>>>>>>>>>>> business. Most IT
>>>>>>>>>>>> organizations don't have a clear picture of how application 
>>>>>>>>>>>> performance
>>>>>>>>>>>> affects their revenue. With AppDynamics, you get 100% 
>>>>>>>>>>>> visibility into
>>>>>>>>>>>> your
>>>>>>>>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of
>>>>>>>>>>>> AppDynamics Pro!
>>>>>>>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>>>>>>>>>>>  
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Opensaf-devel mailing list
>>>>>>>>>>>> Opensaf-devel@lists.sourceforge.net
>>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>>> ------------------------------------------------------------------------------
>>>>>  
>>>>>
>>>>> Rapidly troubleshoot problems before they affect your business. 
>>>>> Most IT
>>>>> organizations don't have a clear picture of how application 
>>>>> performance
>>>>> affects their revenue. With AppDynamics, you get 100% visibility 
>>>>> into your
>>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of 
>>>>> AppDynamics Pro!
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>>>>  
>>>>>
>>>>> _______________________________________________
>>>>> Opensaf-devel mailing list
>>>>> Opensaf-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>>> ------------------------------------------------------------------------------
>>>>  
>>>>
>>>> Rapidly troubleshoot problems before they affect your business. 
>>>> Most IT
>>>> organizations don't have a clear picture of how application 
>>>> performance
>>>> affects their revenue. With AppDynamics, you get 100% visibility 
>>>> into your
>>>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of 
>>>> AppDynamics Pro!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>>>>  
>>>>
>>>> _______________________________________________
>>>> Opensaf-devel mailing list
>>>> Opensaf-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-devel
>>
>>
>
>



------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to