Re: [lwip-users] Out of memory in PCP_PCB pool after 2^32 milliseconds

Trampas Stern Fri, 28 May 2021 13:24:04 -0700

As far as the  ChibiOs  time issues I have a simple rule:

*On my embedded systems every line of code I put into the project becomes
my problem! *


That is if I use LWIP and it has a bug, customers do not care if it is in
LWIP or not, it is my problem to fix.  Hence every line of code becomes my
problem.  As such I try not to use code I do not understand.  Often (LWIP
as example) you have to use libraries but do so knowing that their problems
become yours.  Yes, LWIP has bitten me more than once where it did not work
the way I thought it should/would.  That was my fault and my problem to
fix.

I often go to extremes and I will not use processor vendor defined drivers
until I have done a code review and understand them.  I have been bitten
more than once where vendor's drivers are just "example code."    One
vendor told me that their code should never be used in production, one
vendor had drivers full of bugs and corner cases where it would fail, but
insisted their code was production ready.  I have seen vendor drivers
violate the datasheet.  So detailed code reviews of *all *code is required.

Hence you use ChibiOs  and it has a bug,  well it is now your bug to fix.
That is every line of code in ChibiOs is now your problem..

Trampas








On Fri, May 28, 2021 at 4:05 PM Trampas Stern <[email protected]> wrote:

> So a trick I use in my code and libraries is to use
> typedef's for variables.
>
> typedef uint32_t milliseconds_t;
> milliseconds_t getMillis();
>
> Then I use milliseconds_t to define all variables.  This allows me to
> change it to uint64_t in one location depending on the project.
>
> I have started using more typedef's like this as a form of documentation.
>  That is code is easier to read and follow when variables are defined based
> on the use/type.
>
> A neat fixed point unsigned math trick is when doing comparisons...
>
> milliseconds_t start=   getMillis();
>
> // This is bad
> while( getMillis()<(start +10) ){  //wait for 10ms
> ....
> }
>
> To understand why assume milliseconds_t is uint8_t.  Now we get start and
> say it is 255,  this means (start+10) = 9, now getMillis() on the first
> loop is still 255... So the comparison becomes while (255<9).  So you exit
> while loop early
>
> A better way to do this is
> milliseconds_t start=   getMillis();
>
> // This is good
> while( (getMillis()-start)<10 ){  //wait for 10ms
> ....
> }
>
> Here you if start and getMills() are 255 the first loop is while(0<10).
> Now next millisecond we have (getMillis()-start)  = (0-255) =1  to
> understand this look at the math as in binary:
>  0000 0000
> -1111 1111
> = 1 0000 0001 where the first 1 is the negative bit, but since we are 8
> bit unsigned the value is 1.  This means when doing unsigned subtraction
> you end up with a modulo absolute difference.
>
> Now with that said the code works but other developers might not
> understand it, and you risk them adding code or modifying that breaks
> things.  Therefore often I just use uint64_t just to make sure other
> developers do not break the code.  If speed becomes an issue I can optimize
> the code to use the fixed point math tricks, but only as a last resort.
>
> Note I know many developers that refuse to use unsigned variables due to
> math issues like above.  So they try to use signed integers for most
> everything.  You still have overflow issues but you do not have math
> issues.
>
> Here is a blog article I wrote on embedded systems and time:
> https://bitvolatile.com/?p=303
>
> Trampas
>
>
>
>
>
> On Fri, May 28, 2021 at 3:25 PM Adam Baron <[email protected]> wrote:
>
>> Hello Trampas,
>>
>> thanks for the hints. I initialized the sys ticks with 2^32 - 120
>> seconds, and I got mqtt pbuf=NULL in around 120 seconds + 120 keep alive
>> seconds.
>>
>> The ChibiOs sys_arch.c port includes sys_now() (current time in
>> milliseconds) following simplified implementation:
>>   return ((u32_t)chVTGetSystemTimeX() - 1) / 10 + 1;
>> Since it ticks at 100 uS.
>>
>> I guess it might cause the problems as it overflows back to 0 leaving the
>> lwip timers waiting for value higher than (2^32)/10.
>>
>> To support my guess, I turned on another debug option and last lwip timer
>> message I see is:
>> sys_timeout: 2000C5DC abs_time=429497730 handler=ip_reass_tmr arg=805B28C
>>
>>
>> Adam
>>
>> pá 28. 5. 2021 v 13:45 odesílatel Trampas Stern <[email protected]>
>> napsal:
>>
>>> Increase the counter to a uint64_t.
>>>
>>> You can also start the counter at something other than zero to prove
>>> root cause faster.
>>>
>>> Trampas
>>>
>>> On Fri, May 28, 2021 at 7:08 AM Adam Baron <[email protected]> wrote:
>>>
>>>> Czesc Tomek :),
>>>>
>>>> I'll try to add it. Thanks.
>>>>
>>>> However, I feel like it is rather related to the problem of overflowing
>>>> a uint32 counter of some kind. Since the TCP_PCBs are not freed after 2^32
>>>> ticks.
>>>>
>>>> Adam
>>>>
>>>> pá 28. 5. 2021 v 9:44 odesílatel Tomasz W <[email protected]> napsal:
>>>>
>>>>> Hi (Cześć)
>>>>> Lok for this
>>>>> https://lists.nongnu.org/archive/html/lwip-devel/2020-12/msg00014.html
>>>>> In my case it solved the problem of the web server dying after a few
>>>>> days
>>>>>
>>>>>
>>>>> pt., 28 maj 2021 o 08:58 Adam Baron <[email protected]> napisał(a):
>>>>> >
>>>>> > Hello all,
>>>>> >
>>>>> > I'm having a small STM32F4 application running on devel branch of
>>>>> lwip, It includes httpd, sntp, smtp client, and mqtt client. All is 
>>>>> running
>>>>> well until the fifth day, when mqtt client starts to receive pbuf=NULL and
>>>>> disconnects. My reconnect routine reconnects it in some short time, but it
>>>>> receives pbuf=NULL shortly after.
>>>>> >
>>>>> > Also later on I noticed in log: memp_malloc: out of memory in pool
>>>>> TCP_PCB.
>>>>> > I'm having defined MEMP_NUM_TCP_PCB as 30 and it seems enough for
>>>>> normal operation, I also upped it to 50, but ended with the same problem
>>>>> > In statistics the NUM_TCP_PCB increases and decreases as it should,
>>>>> but after uptime past 5 days it stays high with an error flag triggered.
>>>>> >
>>>>> > Quite interestingly it happens exactly after 2^32 milliseconds
>>>>> uptime. I tried to keep OpenOCD connected to start to peek in, but yet I
>>>>> did not manage to keep the openOCD running for so long without dropping 
>>>>> the
>>>>> connection.
>>>>> >
>>>>> > Does anyone have any ideas please?
>>>>> >
>>>>> > Thanks in advance,
>>>>> > --
>>>>> > 731435556
>>>>> > Adam Baron
>>>>> > _______________________________________________
>>>>> > lwip-users mailing list
>>>>> > [email protected]
>>>>> > https://lists.nongnu.org/mailman/listinfo/lwip-users
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pozdrawiam
>>>>> Tomek
>>>>>
>>>>> _______________________________________________
>>>>> lwip-users mailing list
>>>>> [email protected]
>>>>> https://lists.nongnu.org/mailman/listinfo/lwip-users
>>>>
>>>>
>>>>
>>>> --
>>>> 731435556
>>>> Adam Baron
>>>> _______________________________________________
>>>> lwip-users mailing list
>>>> [email protected]
>>>> https://lists.nongnu.org/mailman/listinfo/lwip-users
>>>
>>> _______________________________________________
>>> lwip-users mailing list
>>> [email protected]
>>> https://lists.nongnu.org/mailman/listinfo/lwip-users
>>
>>
>>
>> --
>> 731435556
>> Adam Baron
>> _______________________________________________
>> lwip-users mailing list
>> [email protected]
>> https://lists.nongnu.org/mailman/listinfo/lwip-users
>
>

_______________________________________________
lwip-users mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/lwip-users

Re: [lwip-users] Out of memory in PCP_PCB pool after 2^32 milliseconds

Reply via email to