Otpvondoiats opened a new pull request, #18102:
URL: https://github.com/apache/nuttx/pull/18102
## Summary
This PR fixes critical deadlock and synchronization issues in the NuttX
sensor subsystem (`drivers/sensors`), addressing three distinct concurrency
problems that could cause system hangs and incorrect behavior in multi-core
sensor operations.
### Problem 1: Deadlock in set_interval and batch Operations (Commit:
298c7f49df1)
**Root Cause:**
When calling `lower->ops->set_interval()` or `lower->ops->batch()` while
holding `upper->lock`, a deadlock could occur if the lower-half driver needs to
access sensor state or trigger callbacks that also require the lock.
**Call Stack Leading to Deadlock:**
```
sensor_ioctl(SNIOC_SET_INTERVAL)
└─> nxrmutex_lock(&upper->lock)
└─> lower->ops->set_interval()
└─> [Driver may callback or access upper state]
└─> nxrmutex_lock(&upper->lock) ← DEADLOCK
```
**Solution:**
- Remove lock before calling `lower->ops->set_interval()` and
`lower->ops->batch()`
- Implement race condition detection by saving original
`min_interval`/`min_latency` values
- Use `goto again` to retry if values were modified by concurrent threads
- Move `sensor_update_interval()` and `sensor_update_latency()` calls
outside the lock in `sensor_close()`
### Problem 2: Deadlock in RPMSG Flush Operation (Commit: f147d9418ea)
**Root Cause:**
The flush operation held `upper->lock` while sending RPMSG messages and
waiting for responses, causing deadlock when the RPMSG handler thread needs the
same lock to process the message.
**Deadlock Scenario:**
```
Thread A: RPMSG Thread:
lock upper_lock
lock upper_lock (nested)
rpmsg_send() ────────> [waiting for upper_lock]
unlock upper_lock
wait_response() <───X─── [cannot respond - blocked on lock]
unlock upper_lock
```
**Solution:**
- Remove `nxrmutex_lock/unlock(&upper->lock)` from `SNIOC_FLUSH` case in
`sensor_ioctl()`
- Allow RPMSG handler to acquire lock and process flush requests
independently
- Validation check (subscriber count) still performed, but without holding
lock during RPMSG communication
### Problem 3: Cross-Core Flush Failure Handling (Commit: af6840d53b4)
**Root Cause:**
When a physical sensor driver doesn't implement the `flush` operation,
cross-core flush requests would fail silently or return incorrect status codes,
causing confusion in distributed sensor systems.
**Issues:**
1. Remote cores couldn't distinguish between "flush not supported" and
"flush failed"
2. Local flush completion events weren't generated for unsupported operations
3. Inconsistent return values between local and remote callers
**Solution:**
- Initialize `ret = -ENOTSUP` before checking flush support
- Generate `SENSOR_EVENT_FLUSH_COMPLETE` immediately when flush is not
supported
- Differentiate return values:
- Local calls (`SENSOR_REMOTE == 0`): Return `0` (success, flush completed
immediately)
- Remote calls: Return `-ENOTSUP` (inform remote core that flush is
unsupported)
- Add logic in `sensor_rpmsg_flush()` to return `-ENOTSUP` when driver has
no flush interface and caller is remote or device is advertiser
### Problem 4: Global Lock Contention in RPMSG Layer (Commit: bc2d950909a)
**Root Cause:**
Global device and endpoint lists (`g_devlist`, `g_eptlist`) were protected
by recursive mutexes (`rmutex_t`), causing unnecessary contention when multiple
threads read the lists simultaneously.
**Performance Issue:**
- Readers blocked each other unnecessarily
- Advertisement/subscription broadcasts serialized across all cores
- Endpoint lookups during message routing became bottlenecks
**Solution:**
- Replace `rmutex_t` with `rw_semaphore_t` (read-write locks)
- Use `down_read()/up_read()` for read-only operations:
- Device lookups (`sensor_rpmsg_find_dev`)
- Advertisement broadcasts (`sensor_rpmsg_advsub`)
- Endpoint iteration during bound events
- Use `down_write()/up_write()` for modifications:
- Adding devices (`sensor_rpmsg_register`)
- Removing devices (`sensor_rpmsg_unregister`)
- Adding/removing endpoints (bind/release)
- Improves scalability in multi-core sensor systems
## Impact
### Users
- **Positive Impact:** Eliminates system hangs and deadlocks in
sensor-intensive applications
- **Positive Impact:** Improved responsiveness for sensor interval/batch
adjustments
- **Positive Impact:** Better performance in multi-core systems with
concurrent sensor operations
- **Behavior Change:** Flush operations on unsupported sensors now complete
immediately with success (local) or ENOTSUP (remote)
- **No API Changes:** All modifications are internal implementation fixes
### Build Process
- No build system changes required
- No new configuration options introduced
- Existing sensor drivers continue to work without modification
# Testing
~~~
ap> uorb_listener -f sensor_accel,sensor_gyro,sensor_mag
Result:
Topic [sensor_mag0] flush: SUCCESS.
Topic [sensor_gyro0] flush: SUCCESS.
Topic [sensor_accel0] flush: SUCCESS.
Total number of flush topics: 3
ap> uorb_listener -n 10 sensor_accel
Mointor objects num:1
object_name:sensor_accel, object_instance:0
sensor_accel(now:1243623900):timestamp:956893288,x:0.394620,y:0.297217,z:9.634545,temperature:29.992188
sensor_accel(now:1243665900):timestamp:956934988,x:0.420956,y:0.227785,z:9.648911,temperature:29.992188
sensor_accel(now:1243706100):timestamp:956975188,x:0.315611,y:0.280458,z:9.850024,temperature:29.992188
sensor_accel(now:1243746300):timestamp:957015388,x:0.385043,y:0.359466,z:9.744678,temperature:29.992188
sensor_accel(now:1243786500):timestamp:957055688,x:0.361101,y:0.285246,z:9.775804,temperature:29.992188
sensor_accel(now:1243826700):timestamp:957095888,x:0.356313,y:0.316371,z:9.833264,temperature:29.992188
sensor_accel(now:1243866800):timestamp:957136088,x:0.468840,y:0.234968,z:9.787775,temperature:29.992188
sensor_accel(now:1243907000):timestamp:957176288,x:0.310823,y:0.297217,z:9.850024,temperature:29.992188
sensor_accel(now:1243947300):timestamp:957216588,x:0.351524,y:0.321159,z:9.814111,temperature:29.992188
sensor_accel(now:1243987200):timestamp:957256688,x:0.387437,y:0.290035,z:9.797352,temperature:29.992188
Object name:sensor_accel0, recieved:10
Total number of received Message:10/10
ap> uorb_listener -i sensor_accel
Topic [sensor_accel0] info:
name:BMI270 Accelerometer
vendor:Bosch
version:3
power:0.010000
max_range:78.480003
resolution:0.002400
min_delay:20000
max_delay:40000
fifo_reserved_event_count:40
fifo_max_event_count:40
ap>
~~~
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]