Re: Connect timeout not always accurate

marko kiiskila Mon, 07 Nov 2016 08:46:19 -0800

> On Nov 7, 2016, at 8:00 AM, Christopher Collins <[email protected]> wrote:
> 
> Hello all,
> 
> Will found a bug in the host, which is probably (hopefully!) the only
> bug here.  I plan on testing and committing a fix for this bug later
> today.  In the meantime, here is the bug:
> 
>    /*  (ble_hs_heartbeat_sched(), ble_hs.c) */
> 
>    /* Reset heartbeat timer if it is not currently scheduled or if the
>     * specified time is sooner than the current expiration time.
>     */
>    if (!os_callout_queued(&ble_hs_heartbeat_timer.cf_c) ||
>        OS_TIME_TICK_LT(ticks_from_now, ble_hs_heartbeat_timer.cf_c.c_ticks)) {
>        //              ^^^^            ^^^^
> 
>        ble_hs_heartbeat_timer_reset(ticks_from_now);
>    }
> 
> The mistake is marked with a // comment above.  A relative time
> (ticks_from_now) is being compared to an absolute time (cf_c.c_ticks),
> which of course yields nonsense.  I will need to double-check, but I
> think this issue is not caught by the any unit tests because each test
> starts at time 0, eliminating the distinction between absolute and
> relative time.  It would be good to advance the OS time by some amount
> prior to running timer-dependent tests.  I will make sure I can get the
> appropriate tests to fail before I fix the bug.
>


That is a good idea. I’d prefer starting with random value. Or by
a timestamp that’s slightly smaller than 0. So you’ll some testing of
the wraparound.

> Thanks Will and Simon!
> 
> Chris
> 
> On Sun, Nov 06, 2016 at 08:44:40PM -0800, Christopher Collins wrote:
>> Hi Simon,
> 
> 
>> 
>> On Sun, Nov 06, 2016 at 01:50:42PM -0800, Simon Ratner wrote:
>>> Hi devs,
>>> 
>>> I am seeing some variance in the handling of `duration_ms` param to
>>> `ble_gap_connect` (on develop). Below are two log traces that demonstrate
>>> it. In both cases, duration_ms=1280; the first shows the timeout event
>>> (status=13) after 535 ticks (4280 ms), while the second timeout is at a
>>> much more respectable 169 ticks (1352 ms).
>>> 
>>> ```
>>> 23352163:[ts=2050146956ssb, mod=4 level=1] GAP procedure initiated:
>>> connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
>>> scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
>>> min_ce_len=16 max_ce_len=768
>>> 23352693:[ts=2054287572ssb, mod=4 level=1] GAP procedure initiated:
>>> advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
>>> own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
>>> adv_data_len=28
>>> 23352698:[ts=2054326632ssb, mod=64 level=2] gatt_cli: connection failed;
>>> status=13
>>> ```
>>> 
>>> ```
>>> 23355205:[ts=2073912596ssb, mod=4 level=1] GAP procedure initiated:
>>> connect; peer_addr_type=1 peer_addr=46:e9:00:75:ce:af scan_itvl=16
>>> scan_window=16 itvl_min=16 itvl_max=32 latency=0 supervision_timeout=256
>>> min_ce_len=16 max_ce_len=768
>>> 23355333:[ts=2074912596ssb, mod=4 level=1] GAP procedure initiated:
>>> advertise; disc_mode=2 peer_addr_type=0 peer_addr=none adv_channel_map=0
>>> own_addr_type=0 adv_filter_policy=0 adv_itvl_min=338 adv_itvl_max=338
>>> adv_data_len=28
>>> 23355374:[ts=2075232888ssb, mod=64 level=2] gatt_cli: connection failed;
>>> status=13
>>> ```
>> 
>> Hmm, I see what you mean.  I think this is probably a bug in the host,
>> though I don't see any obvious issues in the code.  I can think of a few
>> other possibilities, but they are proably not all that likely.  I want
>> to mention them here just so we can rule them out.
>> 
>> 1. Host event processing code is being starved due to tx or rx of too
>> many BLE data packets.
>> 
>> 2. Host parent task doesn't get a chance to run for an extended period
>> of time because higher priority tasks are continuously busy.
>> 
>> Option 1 concerns data packets only, not events from the controller,
>> such as advertising reports.  I think this could only be the case if
>> your device is sending or receiving notifications at a very high rate
>> (e.g., faster than 10 ms).
>> 
>> Option 2 depends on how tasks in the application are prioritized.  Does
>> your application create a task that could have gotten stuck in a busy
>> loop or similar for 30 seconds?
>> 
>> If neither of these sound likely, then it is probably a bug in Mynewt
>> that I will need to replicate.  I noticed from the log that you are
>> connecting and advertising at the same time, which I will try.  Is there
>> anything else you can say about what your application might have been
>> doing when the problem occurred?
>> 
>> Thanks,
>> Chris

Re: Connect timeout not always accurate

Reply via email to