Hi!
I'm working on InfiniTime, an open source firmware for the PineTime.
PineTime is a foss smartwatch based on the NRF52832 MCU.
InfiniTime uses Nimble 1.3 (tag nimble_1_3_0_tag on github) and
FreeRTOS.
A few months ago, I reached this mailing list to find help to fix BLE
connection issues and it works mostly fine since then.
Mostly... except when sending a new firmware for the OTA procedure :
'sometimes', the transfer just stops, without any reason. When this
happens, we have to reset the whole MCU because the ble stack looks
completely frozen : no advertising, no connection,...
I've finally decided to debug this issue using a BLE sniffer (based on
the NRF52-DK), btmon connected to the RTT output of Nimble and a logic
analyzer.
When the transfert is running, ble sniffer shows a 'write' package from
the phone and an empty packet from the watch. When the transfert fails,
it looks like the watch does not sent empty PDU between each packet sent
by the host.
Btmon show that everything just stop, without any error. For example :
ACL Data RX: Handle 1 flags 0x02 dlen 27
#238433
18446744073709548371.836900
ATT: Write Command (0x52) len 22
Handle: 0x0044
Data: 3df807200fb030bd9df80430042b01d02046f2e7
ACL Data RX: Handle 1 flags 0x02 dlen 27
#238434
18446744073709548371.837500
ATT: Write Command (0x52) len 22
Handle: 0x0044
Data: dde902231568516892689a6020461d605960e8e7
* Drops: cmd 0 evt 0 acl_tx 1 acl_rx 0 sco_tx 0 sco_rx 0 other 0
ACL Data RX: Handle 1 flags 0x02 dlen 27
#238435
18446744073709548371.883500
ATT: Write Command (0x52) len 22
Handle: 0x0044
Data: 2af05cfa104a9df810301468adf80e5068f30003
ACL Data RX: Handle 1 flags 0x02 dlen 27
#238436
18446744073709548371.884100
ATT: Write Command (0x52) len 22
Handle: 0x0044
Data: 0c22adf80c7002968df810308df8042034b1d4e9
This line is actually the last line of the transfer when it failed.
Using the logic analyzer connected to debug pins I set/clear at specific
places of the code, I observed that, when the error occurs, the ll_task
just look frozen : it does not handle any event anymore.
On the attached screenshot, you'll find a capture from the logic
analyzer :
- The first channel (D0) shows the activity of the LL task (1 =
processing event)
- the second channel (D3) is set while waiting for an event from the
queue (xQueueReceive())
- The third channel (D2) is set in npl_freertos_eventq_put().
- the last channel (D4) is set in npl_freertos_eventq_remove().
Most of the time, I can see short bursts of 9 events (the one on the
left of the picture). When it fails, there are less than 9 events, and
you can see than the last 'put' took more time than previously to run.
It looks like a deadlock occurs somewhere. Is it in my integration of
nimble? In the freertos port part of nimble? Or maybe a bug in nimble or
freertos?
I don't really know where to look next. Any suggestion to debug this
issue?
Thanks
JF