Re: [PR] sched: add a reference count to the TCB to prevent it from being deleted. [nuttx]

via GitHub Fri, 19 Dec 2025 20:53:18 -0800


hujun260 commented on PR #17468:
URL: https://github.com/apache/nuttx/pull/17468#issuecomment-3677382071


   > > @anchao @acassis **Patch Motivation & Proper Solution:** The main 
purpose of this patch is to address use-after-free issues that can occur when a 
thread unexpectedly exits. For example, in scenarios such as proc_open(), when 
nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher 
priority, or during calls to functions like` nxclock_gettime(), 
sched_backtrace(), or mm_memdump(),` a similar risk exists. These problems 
should be systematically addressed by introducing critical sections or by 
employing other synchronization strategies in those critical paths.
   > > **Not Just SMP:** These use-after-free problems can occur in both SMP 
and UP environments — this is not an SMP-only issue.
   > > **Limitations of Critical Sections:** Even with the addition of critical 
sections, these issues may not be fully resolved. For instance, consider the 
following example:
   > > ```
   > > 
   > > mm_memdump() 
   > > {
   > >     enter_critical_section();
   > >     tcb = nxsched_get_tcb(pid);
   > >     syslog();             // might wake up or schedule another thread
   > >     name = get_task_name(tcb);
   > >     leave_critical_section();
   > > }
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > In this scenario, even with enter_critical_section(), there can still be 
a race if syslog() internally leads to context switches that affect the TCB’s 
lifetime.
   > > **On Mainstream RTOS Practice:** The notion that "mainstream RTOS do not 
incorporate this feature, as the number of threads is predetermined in most 
scenarios" is not accurate. In a typical IoT system, creating and destroying 
threads (pthreads) is frequent and widespread—thread counts are rarely static. 
While in safety-critical automotive contexts thread counts might be fixed, such 
scenarios are the exception, not the rule, for consumer and general-purpose 
products.
   > > **On Performance and Size Impact:** Statements such as "it will exert an 
extremely detrimental impact on both size and performance" are not 
substantiated without concrete measurement or testing. According to the data 
provided by @hujun260, the actual performance impact is minimal; please 
carefully review the test report. Regarding code size: yes, adding additional 
safety checks will increase flash usage, and we will supplement the PR with 
precise flash size measurements.
   > > If there is a better alternative to solve this problem, then this patch 
is unnecessary. Patches are welcome.
   > 
   > You can try running this code. The nxsched_get_tcb() function incurs a 
2.5~4% performance overhead, and this function is used throughout the entire 
nuttx:
   > 
   > ```
   > #include <nuttx/config.h>
   > #include <stdio.h>
   > #include <pthread.h>
   > 
   > 
/****************************************************************************
   >  * Public Functions
   >  
****************************************************************************/
   > 
   > 
/****************************************************************************
   >  * hello_main
   >  
****************************************************************************/
   > pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;
   > 
   > static void timespec_diff(const struct timespec *start, const struct 
timespec *end, struct timespec *diff) {
   >     diff->tv_sec = end->tv_sec - start->tv_sec;
   >     diff->tv_nsec = end->tv_nsec - start->tv_nsec;
   >     
   >     if (diff->tv_nsec < 0) {
   >         diff->tv_sec--;
   >         diff->tv_nsec += 1000000000;
   >     }
   > }
   > 
   > static void timespec_add(struct timespec *total, const struct timespec 
*diff) {
   >     total->tv_sec += diff->tv_sec;
   >     total->tv_nsec += diff->tv_nsec;
   >     
   >     if (total->tv_nsec >= 1000000000) {
   >         total->tv_sec += total->tv_nsec / 1000000000;
   >         total->tv_nsec = total->tv_nsec % 1000000000;
   >     }
   > }
   > 
   > static void timespec_avg(const struct timespec *total, int count, struct 
timespec *avg) {
   >     uint64_t total_ns = (uint64_t)total->tv_sec * 1000000000 + 
total->tv_nsec;
   >     uint64_t avg_ns = total_ns / count;
   >     
   >     avg->tv_sec = avg_ns / 1000000000;
   >     avg->tv_nsec = avg_ns % 1000000000;
   > }
   > 
   > int main(int argc, char *argv[])                                           
                                    
   > {             
   >     struct timespec start;
   >     struct timespec end;                                                   
                                          
   >     struct timespec diff;
   >     struct timespec total = {0, 0};
   >     struct timespec avg;           
   >     int i, j = 0;
   >     const int loop_count = 10;
   >     
   >     while (j < loop_count)
   >     { 
   >         i = 0;
   >         j++;
   >         
   >         clock_gettime(CLOCK_BOOTTIME, &start);
   >         pthread_mutex_lock(&g_mutex);
   >         while (i < 1000 * 1000)
   >         { 
   >             i++;
   >             pthread_mutex_trylock(&g_mutex);
   >         }
   >         pthread_mutex_unlock(&g_mutex);
   >         
   >         clock_gettime(CLOCK_BOOTTIME, &end);
   >         
   >         timespec_diff(&start, &end, &diff);
   >         
   >         timespec_add(&total, &diff);
   >         
   >         timespec_avg(&total, j, &avg);
   >         
   >         printf("%d: diff = %lu.%09lu s | avg = %lu.%09lu s\n", 
   >                j, 
   >                (unsigned long)diff.tv_sec, (unsigned long)diff.tv_nsec,
   >                (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
   >     }
   >     
   >     printf("\n===== result =====\n");
   >     printf("count: %d\n", loop_count);
   >     printf("total: %lu.%09lu s\n", (unsigned long)total.tv_sec, (unsigned 
long)total.tv_nsec);
   >     printf("avg: %lu.%09lu s\n", (unsigned long)avg.tv_sec, (unsigned 
long)avg.tv_nsec);
   >     
   >     return 0;
   > }
   > ```
   > 
   > Additionally, I conducted the validation on the sim host PC; running this 
test on an MCU may result in even greater performance degradation:
   > 
   > Before this change:
   > 
   > ```
   > nsh> hello
   > 1: diff = 0.306540108 s | avg = 0.306540108 s
   > 2: diff = 0.297250552 s | avg = 0.301895330 s
   > 3: diff = 0.296764389 s | avg = 0.300185016 s
   > 4: diff = 0.297345113 s | avg = 0.299475040 s
   > 5: diff = 0.296678248 s | avg = 0.298915682 s
   > 6: diff = 0.296436913 s | avg = 0.298502553 s
   > 7: diff = 0.296414429 s | avg = 0.298204250 s
   > 8: diff = 0.296148430 s | avg = 0.297947272 s
   > 9: diff = 0.296353964 s | avg = 0.297770238 s
   > 10: diff = 0.296554913 s | avg = 0.297648705 s
   > 
   > ===== result =====
   > count: 10
   > total: 2.976487059 s
   > avg: 0.297648705 s
   > ```
   > 
   > After this change:
   > 
   > ```
   > nsh> hello
   > 1: diff = 0.322894500 s | avg = 0.322894500 s
   > 2: diff = 0.304714785 s | avg = 0.313804642 s
   > 3: diff = 0.303939787 s | avg = 0.310516357 s
   > 4: diff = 0.304068904 s | avg = 0.308904494 s
   > 5: diff = 0.304900920 s | avg = 0.308103779 s
   > 6: diff = 0.304780429 s | avg = 0.307549887 s
   > 7: diff = 0.304122773 s | avg = 0.307060299 s
   > 8: diff = 0.304266724 s | avg = 0.306711102 s
   > 9: diff = 0.304566429 s | avg = 0.306472805 s
   > 10: diff = 0.303873630 s | avg = 0.306212888 s
   > 
   > ===== result =====
   > count: 10
   > total: 3.062128881 s
   > avg: 0.306212888 s
   > ```
   
   I've added the nxsched_get_tcb_noref interface—its performance is the same 
as before—and I've also reimplemented nxsched_verify_pid.
   
   Furthermore, I retested the scenario you mentioned, and there's no 
performance degradation at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] sched: add a reference count to the TCB to prevent it from being deleted. [nuttx]

Reply via email to