Fix-Point opened a new pull request, #17312: URL: https://github.com/apache/nuttx/pull/17312
## Summary Part I of the PR https://github.com/apache/nuttx/pull/17276. This part of commits simplified current timer driver for the ClockDevice. This PR proposed ClockDevice, a new timer driver abstraction for NuttX. The new CLOCKDEVICE timer hardware abstraction delivers: - Functional correctness: Thread-safe and overflow-free - High performance: Up to 3x faster on ARMv7A platforms - Theoretically optimal timing precision: Uses hardware cycle counts as the time unit. - Time-unit independent interfaces: Decouples timer drivers from OS time subsystems - Minimalist driver implementation: Reduces driver code size by nearly 70%. ## Situation Let’s look at the current state of the timing subsystem in NuttX. <img width="556" height="276" alt="image" src="https://github.com/user-attachments/assets/214b74e1-6fe7-4c64-8b58-3dab0043557a" /> The NuttX timing subsystem consists of four layers: - **Hardware Timer Drivers**: Includes implementations of various hardware timer drivers. - **Timer Driver Abstraction**: Such as `Oneshot` and `Timer`, which provide timer hardware abstraction. - **OS Timer Interfaces**: `Timer(up_timer_*)` and `Alarm(up_alarm_*)`, offering relative and absolute timer interfaces. - **OS Timer Abstraction**: The `wdog` module manages software timers and provides a unified timer API to upper layers. Here we focus on the oneshot timer driver abstraction. So what are the current problem? First, there are problems related to time conversion, primarily overflow and loss of precision. **Overflow**: NuttX performs conversions between timespec and tick values, which can lead to overflow. For example, converting a timespec into ticks could cause multiplication to overflow, resulting in incorrect values. **Loss of Precision**: The current ARM Generic Timer uses `cycle_per_tick` to simplify time calculations. However, this approach only works accurately if the frequency is divisible by `USEC_PER_TICK`. If not, significant precision loss can occur. For instance, on `Cortex-R82`, the timer frequency might be `2.73MHz`, a value that often doesn’t divide cleanly, leading to timing inaccuracies. Given how error-prone time conversion can be, why not provide a unified, reusable, and correct implementation for time conversion? The second major issue is performance. **Unnecessary current time acquring**: One contributing factor is unnecessary time acquring overhead. When setting a timer using a relative timer interface, even on hardware that natively supports absolute timing, the system incurs extra CPU cycles (around 60 cycles on x86) to read the current time. **Division Latency**: Another source of overhead is division operations. Calculating the current tick or timespec value requires division. Due to the complex implemetation of division in hardware, it has significantly higher latency compared to other operations. This not only slows down performance but also introduces timing jitter. Moreover, the driver is also responsible for managing callback functions and handling multi-core concurrency. Improper handling can easily lead to thread-safety issues. ## ClockDevice To address these issues, we propose `ClockDevice`, a better timer driver abstraction designed to replace the original `oneshot` API. It aims to achieve the following design goals: - Functional correctness (no overflow in time conversion) - Optimized performance - Simplified driver implementation - Expressive interfaces `ClockDevice` implements two methods to accelerate time conversion: **Invariant-divisor Divsion (INVDIV)** : used to convert clock counts to seconds or ticks. For all $n$ and divisor $d$, we can find $m$ and $s$, ensures $`\lfloor \frac {n \times m}{2^{s}} \rfloor = \lfloor \frac{n}{d} \rfloor`$ **Benefit**: Transform the division to 1x unsigned multiplication high (`UMULH`), 1x subtraction, 1x addition and 1x logical right-shifting (`LShR`). **Multiply-Shift Approximate Division**: Used to convert delta counts into nanoseconds or ticks. This method, adopted in the Linux kernel for time conversion, trades off slight precision (a few nanoseconds) for better performance. However, due to potential multiplication overflow, it is only suitable for relative time conversions. The previous optimization ensures the accurate division but costs 6-9 CPU cycles. This method only needs 1x unsigned multiplication and 1x `LShR`, which usually costs 4 CPU cycles. **Benefit**: It exchanges the time precision for better performance. ## Impact These code commits affect the timing subsystem, as well as the following architecture: - arm-v7a/arm-v7r/arm-v8r - arm-v8a - riscv - sim - tricore - intel64 ## Testing To evaluate the performance improvements, we conducted tests on three platforms: qemu-inte64/KVM, imx8qm-mek/arm64, and qemu-armv7a. We measured the CPU cycle overhead for: - Reading the current time - Setting a timer - Handling a timer callback Each operation was executed 10 million times, and the results were averaged. The results demonstrated significant performance improvments. On `qemu-armv7a`, the software division is the performance bottle-neck. ClockDevice brings significant performance improvements: | API | speed-up| | - | - | | clock_gettime | 2.56x | | clock_systime_ticks | 3.00x | | wd_start | 1.67x | | wd_start_cancel | 1.43x | | wd_expiration | 2.36x | <img width="579" height="435" alt="image" src="https://github.com/user-attachments/assets/dc7394c2-0cc8-4c99-81a8-2920f14fbc2b" /> On `qemu-inte64/KVM`, ClockDevice achieved up to **1.42x** performance improved. Especially, for `clock_gettime` API, NuttX with ClockDevice improvements had **1.31x** better performance than Linux Kernel 6.8.0-51. <img width="597" height="448" alt="image" src="https://github.com/user-attachments/assets/5c3a2dba-5823-41f3-9e73-17f0327de25c" /> On `imx8qm-mek/arm64`, ClockDevice achieved up to **1.68x** performance improvement. Note on the early ARM64 platform (Cortex A-53), the `INVDIV` optimization can not work well since the hardware division instruction `UDIV` costs less CPU cycles on average than the `INVDIV`. <img width="607" height="455" alt="image" src="https://github.com/user-attachments/assets/576edd31-8a26-4a1c-8a5f-c86b8402157a" /> ## Plan Due to the need for extensive code modifications, some work remains unfinished. The following is a list of the planned tasks. - [x] 1. Simplify the timer drivers and add SMP initialization. - [x] 2. Remove the callback and arg from the oneshot API. - [x] 3. Remove tick-based oneshot API. - [x] 4. Add new count-based oneshot API (ClockDevice). - [x] 5. Introduce optimized fast-path for count-based oneshot API. - [x] 6. Reimplement the common-use timer drivers with count-based oneshot API. - [x] 7. Add document for ClockDevice. - [WIP] 8. Inlining arch_alarm for performance (3% ~ 5% less execution time for clock_gettime, wd_start and wd_expiration). Architectures support: - [x] arm-v7a/v7r/v8r generic timer - [x] goldfish - [x] arm-v8a generic timer - [x] sim - [x] intel64 TSC-deadline timer - [x] tricore systimer - [x] risc-v mtime - [WIP] risc-v bl602 timer - [WIP] risc-v esp32c3 timer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
