Fix-Point opened a new pull request, #17312:
URL: https://github.com/apache/nuttx/pull/17312

   ## Summary
   
   Part I of the PR https://github.com/apache/nuttx/pull/17276. This part of 
commits simplified current timer driver for the ClockDevice. 
   
   This PR proposed ClockDevice, a new timer driver abstraction for NuttX. The 
new CLOCKDEVICE timer hardware abstraction delivers:
   - Functional correctness: Thread-safe and overflow-free
   - High performance: Up to 3x faster on ARMv7A platforms
   - Theoretically optimal timing precision: Uses hardware cycle counts as the 
time unit.
   - Time-unit independent interfaces: Decouples timer drivers from OS time 
subsystems
   - Minimalist driver implementation: Reduces driver code size by nearly 70%.
   
   ## Situation
   
   Let’s look at the current state of the timing subsystem in NuttX.
   <img width="556" height="276" alt="image" 
src="https://github.com/user-attachments/assets/214b74e1-6fe7-4c64-8b58-3dab0043557a";
 />
   
   The NuttX timing subsystem consists of four layers:
   - **Hardware Timer Drivers**: Includes implementations of various hardware 
timer drivers.
   - **Timer Driver Abstraction**: Such as `Oneshot` and `Timer`, which provide 
timer hardware abstraction.
   - **OS Timer Interfaces**: `Timer(up_timer_*)` and `Alarm(up_alarm_*)`, 
offering relative and absolute timer interfaces.
   - **OS Timer Abstraction**: The `wdog` module manages software timers and 
provides a unified timer API to upper layers.
   
   Here we focus on the oneshot timer driver abstraction.
   
   So what are the current problem?
   
   First, there are problems related to time conversion, primarily overflow and 
loss of precision.
   
   **Overflow**:
   NuttX performs conversions between timespec and tick values, which can lead 
to overflow.
   For example, converting a timespec into ticks could cause multiplication to 
overflow, resulting in incorrect values.
   
   **Loss of Precision**:
   The current ARM Generic Timer uses `cycle_per_tick` to simplify time 
calculations. However, this approach only works accurately if the frequency is 
divisible by `USEC_PER_TICK`.
   If not, significant precision loss can occur. For instance, on `Cortex-R82`, 
the timer frequency might be `2.73MHz`, a value that often doesn’t divide 
cleanly, leading to timing inaccuracies.
   
   Given how error-prone time conversion can be, why not provide a unified, 
reusable, and correct implementation for time conversion?
   
   The second major issue is performance.
   
   **Unnecessary current time acquring**:
   One contributing factor is unnecessary time acquring overhead. When setting 
a timer using a relative timer interface, even on hardware that natively 
supports absolute timing, the system incurs extra CPU cycles (around 60 cycles 
on x86) to read the current time.
   
   **Division Latency**:
   Another source of overhead is division operations.
   Calculating the current tick or timespec value requires division. Due to the 
complex implemetation of division in hardware, it has significantly higher 
latency compared to other operations. This not only slows down performance but 
also introduces timing jitter.
   
   Moreover, the driver is also responsible for managing callback functions and 
handling multi-core concurrency. Improper handling can easily lead to 
thread-safety issues.
   
   ## ClockDevice
   
   To address these issues, we propose `ClockDevice`, a better timer driver 
abstraction designed to replace the original `oneshot` API. It aims to achieve 
the following design goals:
   - Functional correctness (no overflow in time conversion)
   - Optimized performance
   - Simplified driver implementation
   - Expressive interfaces
   
   `ClockDevice` implements two methods to accelerate time conversion:
   
   **Invariant-divisor Divsion (INVDIV)** : used to convert clock counts to 
seconds or ticks.
   For all $n$ and divisor $d$, we can find $m$ and $s$, ensures $`\lfloor 
\frac {n \times m}{2^{s}} \rfloor = \lfloor \frac{n}{d} \rfloor`$
   **Benefit**: Transform the division to 1x unsigned multiplication high 
(`UMULH`), 1x subtraction, 1x addition and 1x logical right-shifting (`LShR`). 
   
   
   **Multiply-Shift Approximate Division**: Used to convert delta counts into 
nanoseconds or ticks.
   This method, adopted in the Linux kernel for time conversion, trades off 
slight precision (a few nanoseconds) for better performance. However, due to 
potential multiplication overflow, it is only suitable for relative time 
conversions.
   The previous optimization ensures the accurate division but costs 6-9 CPU 
cycles.
   This method only needs 1x unsigned multiplication and 1x `LShR`, which 
usually costs 4 CPU cycles.
   **Benefit**: It exchanges the time precision for better performance.  
   
   ## Impact
   
   These code commits affect the timing subsystem, as well as the following 
architecture:
   - arm-v7a/arm-v7r/arm-v8r
   - arm-v8a
   - riscv
   - sim
   - tricore
   - intel64
   
   ## Testing
   
   To evaluate the performance improvements, we conducted tests on three 
platforms: qemu-inte64/KVM, imx8qm-mek/arm64, and qemu-armv7a. We measured the 
CPU cycle overhead for:
   
   - Reading the current time
   - Setting a timer
   - Handling a timer callback
   
   Each operation was executed 10 million times, and the results were averaged.
   
   The results demonstrated significant performance improvments.
   
   On `qemu-armv7a`, the software division is the performance bottle-neck. 
ClockDevice brings significant performance improvements:
   | API | speed-up|
   | -     |  -             |
   | clock_gettime | 2.56x |
   | clock_systime_ticks | 3.00x |
   | wd_start | 1.67x |
   | wd_start_cancel | 1.43x |
   | wd_expiration | 2.36x |
   <img width="579" height="435" alt="image" 
src="https://github.com/user-attachments/assets/dc7394c2-0cc8-4c99-81a8-2920f14fbc2b";
 />
   
   On `qemu-inte64/KVM`, ClockDevice achieved up to **1.42x** performance 
improved. Especially, for `clock_gettime` API, NuttX with ClockDevice 
improvements had **1.31x** better performance than Linux Kernel 6.8.0-51.
   <img width="597" height="448" alt="image" 
src="https://github.com/user-attachments/assets/5c3a2dba-5823-41f3-9e73-17f0327de25c";
 />
   
   On `imx8qm-mek/arm64`, ClockDevice achieved up to **1.68x** performance 
improvement. Note on the early ARM64 platform (Cortex A-53), the `INVDIV` 
optimization can not work well since the hardware division instruction `UDIV` 
costs less CPU cycles on average than the `INVDIV`.
   <img width="607" height="455" alt="image" 
src="https://github.com/user-attachments/assets/576edd31-8a26-4a1c-8a5f-c86b8402157a";
 />
   
   ## Plan
   
   Due to the need for extensive code modifications, some work remains 
unfinished. The following is a list of the planned tasks.
   
   - [x] 1. Simplify the timer drivers and add SMP initialization.
   - [x] 2. Remove the callback and arg from the oneshot API.
   - [x] 3. Remove tick-based oneshot API.
   - [x] 4. Add new count-based oneshot API (ClockDevice).
   - [x] 5. Introduce optimized fast-path for count-based oneshot API.
   - [x] 6. Reimplement the common-use timer drivers with count-based oneshot 
API.
   - [x] 7. Add document for ClockDevice.
   - [WIP] 8. Inlining arch_alarm for performance (3% ~ 5% less execution time 
for clock_gettime, wd_start and wd_expiration).
   
   Architectures support:
   - [x] arm-v7a/v7r/v8r generic timer
   - [x] goldfish
   - [x] arm-v8a generic timer
   - [x] sim
   - [x] intel64 TSC-deadline timer
   - [x] tricore systimer
   - [x] risc-v mtime
   - [WIP] risc-v bl602 timer
   - [WIP] risc-v esp32c3 timer  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to