Otpvondoiats opened a new pull request, #18102:
URL: https://github.com/apache/nuttx/pull/18102

   ## Summary
   
   This PR fixes critical deadlock and synchronization issues in the NuttX 
sensor subsystem (`drivers/sensors`), addressing three distinct concurrency 
problems that could cause system hangs and incorrect behavior in multi-core 
sensor operations.
   
   ### Problem 1: Deadlock in set_interval and batch Operations (Commit: 
298c7f49df1)
   
   **Root Cause:**  
   When calling `lower->ops->set_interval()` or `lower->ops->batch()` while 
holding `upper->lock`, a deadlock could occur if the lower-half driver needs to 
access sensor state or trigger callbacks that also require the lock.
   
   **Call Stack Leading to Deadlock:**
   ```
   sensor_ioctl(SNIOC_SET_INTERVAL)
     └─> nxrmutex_lock(&upper->lock)
         └─> lower->ops->set_interval()
             └─> [Driver may callback or access upper state]
                 └─> nxrmutex_lock(&upper->lock)  ← DEADLOCK
   ```
   
   **Solution:**
   - Remove lock before calling `lower->ops->set_interval()` and 
`lower->ops->batch()`
   - Implement race condition detection by saving original 
`min_interval`/`min_latency` values
   - Use `goto again` to retry if values were modified by concurrent threads
   - Move `sensor_update_interval()` and `sensor_update_latency()` calls 
outside the lock in `sensor_close()`
   
   ### Problem 2: Deadlock in RPMSG Flush Operation (Commit: f147d9418ea)
   
   **Root Cause:**  
   The flush operation held `upper->lock` while sending RPMSG messages and 
waiting for responses, causing deadlock when the RPMSG handler thread needs the 
same lock to process the message.
   
   **Deadlock Scenario:**
   ```
   Thread A:                           RPMSG Thread:
   lock upper_lock
     lock upper_lock (nested)
     rpmsg_send()        ────────>     [waiting for upper_lock]
     unlock upper_lock
     wait_response()     <───X───      [cannot respond - blocked on lock]
   unlock upper_lock
   ```
   
   **Solution:**
   - Remove `nxrmutex_lock/unlock(&upper->lock)` from `SNIOC_FLUSH` case in 
`sensor_ioctl()`
   - Allow RPMSG handler to acquire lock and process flush requests 
independently
   - Validation check (subscriber count) still performed, but without holding 
lock during RPMSG communication
   
   ### Problem 3: Cross-Core Flush Failure Handling (Commit: af6840d53b4)
   
   **Root Cause:**  
   When a physical sensor driver doesn't implement the `flush` operation, 
cross-core flush requests would fail silently or return incorrect status codes, 
causing confusion in distributed sensor systems.
   
   **Issues:**
   1. Remote cores couldn't distinguish between "flush not supported" and 
"flush failed"
   2. Local flush completion events weren't generated for unsupported operations
   3. Inconsistent return values between local and remote callers
   
   **Solution:**
   - Initialize `ret = -ENOTSUP` before checking flush support
   - Generate `SENSOR_EVENT_FLUSH_COMPLETE` immediately when flush is not 
supported
   - Differentiate return values:
     - Local calls (`SENSOR_REMOTE == 0`): Return `0` (success, flush completed 
immediately)
     - Remote calls: Return `-ENOTSUP` (inform remote core that flush is 
unsupported)
   - Add logic in `sensor_rpmsg_flush()` to return `-ENOTSUP` when driver has 
no flush interface and caller is remote or device is advertiser
   
   ### Problem 4: Global Lock Contention in RPMSG Layer (Commit: bc2d950909a)
   
   **Root Cause:**  
   Global device and endpoint lists (`g_devlist`, `g_eptlist`) were protected 
by recursive mutexes (`rmutex_t`), causing unnecessary contention when multiple 
threads read the lists simultaneously.
   
   **Performance Issue:**
   - Readers blocked each other unnecessarily
   - Advertisement/subscription broadcasts serialized across all cores
   - Endpoint lookups during message routing became bottlenecks
   
   **Solution:**
   - Replace `rmutex_t` with `rw_semaphore_t` (read-write locks)
   - Use `down_read()/up_read()` for read-only operations:
     - Device lookups (`sensor_rpmsg_find_dev`)
     - Advertisement broadcasts (`sensor_rpmsg_advsub`)
     - Endpoint iteration during bound events
   - Use `down_write()/up_write()` for modifications:
     - Adding devices (`sensor_rpmsg_register`)
     - Removing devices (`sensor_rpmsg_unregister`)
     - Adding/removing endpoints (bind/release)
   - Improves scalability in multi-core sensor systems
   
   ## Impact
   
   ### Users
   - **Positive Impact:** Eliminates system hangs and deadlocks in 
sensor-intensive applications
   - **Positive Impact:** Improved responsiveness for sensor interval/batch 
adjustments
   - **Positive Impact:** Better performance in multi-core systems with 
concurrent sensor operations
   - **Behavior Change:** Flush operations on unsupported sensors now complete 
immediately with success (local) or ENOTSUP (remote)
   - **No API Changes:** All modifications are internal implementation fixes
   
   ### Build Process
   - No build system changes required
   - No new configuration options introduced
   - Existing sensor drivers continue to work without modification
   
   # Testing
   
   ~~~
   ap> uorb_listener -f sensor_accel,sensor_gyro,sensor_mag
   Result:
           Topic [sensor_mag0] flush: SUCCESS.
           Topic [sensor_gyro0] flush: SUCCESS.
           Topic [sensor_accel0] flush: SUCCESS.
   Total number of flush topics: 3
   ap> uorb_listener -n 10 sensor_accel
   
   Mointor objects num:1
   object_name:sensor_accel, object_instance:0
   
sensor_accel(now:1243623900):timestamp:956893288,x:0.394620,y:0.297217,z:9.634545,temperature:29.992188
   
sensor_accel(now:1243665900):timestamp:956934988,x:0.420956,y:0.227785,z:9.648911,temperature:29.992188
   
sensor_accel(now:1243706100):timestamp:956975188,x:0.315611,y:0.280458,z:9.850024,temperature:29.992188
   
sensor_accel(now:1243746300):timestamp:957015388,x:0.385043,y:0.359466,z:9.744678,temperature:29.992188
   
sensor_accel(now:1243786500):timestamp:957055688,x:0.361101,y:0.285246,z:9.775804,temperature:29.992188
   
sensor_accel(now:1243826700):timestamp:957095888,x:0.356313,y:0.316371,z:9.833264,temperature:29.992188
   
sensor_accel(now:1243866800):timestamp:957136088,x:0.468840,y:0.234968,z:9.787775,temperature:29.992188
   
sensor_accel(now:1243907000):timestamp:957176288,x:0.310823,y:0.297217,z:9.850024,temperature:29.992188
   
sensor_accel(now:1243947300):timestamp:957216588,x:0.351524,y:0.321159,z:9.814111,temperature:29.992188
   
sensor_accel(now:1243987200):timestamp:957256688,x:0.387437,y:0.290035,z:9.797352,temperature:29.992188
   Object name:sensor_accel0, recieved:10
   Total number of received Message:10/10
   ap> uorb_listener -i sensor_accel
   Topic [sensor_accel0] info:
           name:BMI270 Accelerometer
           vendor:Bosch
           version:3
           power:0.010000
           max_range:78.480003
           resolution:0.002400
           min_delay:20000
           max_delay:40000
           fifo_reserved_event_count:40
           fifo_max_event_count:40
   
   ap>
   ~~~
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to