From: Lizhi Hou <[email protected]> [ Upstream commit cd77d5a4aaf8c5c1d819f47cf814bf7d4920b0a2 ]
In mailbox_get_msg(), mailbox_reg_read_non_zero() is called to poll for a non-zero tail pointer. This assumed that a zero value indicates an error. However, certain corner cases legitimately produce a zero tail pointer. To handle these cases, remove mailbox_reg_read_non_zero(). The zero tail pointer will be treated as a valid rewind event. Reviewed-by: Maciej Falkowski <[email protected]> Signed-off-by: Lizhi Hou <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Sasha Levin <[email protected]> --- LLM Generated explanations, may be completely bogus: Good - `mailbox_get_tailptr()` already exists in v6.14 and the fix uses it. No dependency issues. ## Comprehensive Analysis ### 1. Commit Message Analysis The commit title explicitly says "Fix" and the message clearly describes the bug: `mailbox_reg_read_non_zero()` polls for a non-zero tail pointer, **incorrectly assuming that a zero value indicates an error**. However, zero is a legitimate tail pointer value in certain corner cases (ring buffer rewind events). The fix removes the flawed polling function and instead uses a simple direct read via the existing `mailbox_get_tailptr()`. ### 2. Code Change Analysis - The Bug Mechanism The bug is in `mailbox_reg_read_non_zero()`: ```115:129:/home/sasha/linux- autosel/drivers/accel/amdxdna/amdxdna_mailbox.c static int mailbox_reg_read_non_zero(struct mailbox_channel *mb_chann, u32 mbox_reg, u32 *val) { struct xdna_mailbox_res *mb_res = &mb_chann->mb->res; void __iomem *ringbuf_addr = mb_res->mbox_base + mbox_reg; int ret, value; /* Poll till value is not zero */ ret = readx_poll_timeout(readl, ringbuf_addr, value, value, 1 /* us */, 100); if (ret < 0) return ret; *val = value; return 0; } ``` This function uses `readx_poll_timeout()` with `value` as both the variable AND the condition (the condition is just `value`, i.e., truthy/non-zero). It polls every 1us for up to 100us. If the tail pointer register is zero after 100us, it returns `-ETIMEDOUT`. The call site in `mailbox_get_msg()`: ```289:290:/home/sasha/linux- autosel/drivers/accel/amdxdna/amdxdna_mailbox.c if (mailbox_reg_read_non_zero(mb_chann, mb_chann->res[CHAN_RES_I2X].mb_tail_ptr_reg, &tail)) return -EINVAL; ``` When the function times out (tail is legitimately zero), `mailbox_get_msg()` returns `-EINVAL`. **The catastrophic consequence** is in `mailbox_rx_worker()` (in v6.14): ```377:386:/home/sasha/linux- autosel/drivers/accel/amdxdna/amdxdna_mailbox.c /* Other error means device doesn't look good, disable irq. */ if (unlikely(ret)) { MB_ERR(mb_chann, "Unexpected ret %d, disable irq", ret); WRITE_ONCE(mb_chann->bad_state, true); disable_irq(mb_chann->msix_irq); break; } ``` Any return other than `0` or `-ENOENT` causes: 1. `bad_state = true` is set **permanently** (never reset) 2. `disable_irq()` is called (in v6.14) 3. All future work items check `bad_state` and immediately return 4. The mailbox channel is **permanently dead** - no more firmware communication is possible 5. The AMD AI Engine accelerator becomes non-functional **When does tail=0 legitimately occur?** The ring buffer protocol uses a TOMBSTONE marker when wrapping. After the writer (firmware, for the I2X channel) writes a TOMBSTONE at the end of the buffer and wraps to the start, the tail pointer can be zero briefly or permanently depending on the firmware's message scheduling. The commit describes this as a "valid rewind event." ### 3. Fix Correctness The fix replaces the flawed polling function with a direct register read: ```c tail = mailbox_get_tailptr(mb_chann, CHAN_RES_I2X); ``` This is obviously correct because the existing validation in `mailbox_get_msg()` properly handles a zero tail: - `tail > ringbuf_size || !IS_ALIGNED(tail, 4)` → 0 passes (aligned and within bounds) - `head == tail` → if head is also 0, returns `-ENOENT` (empty buffer, which is the correct behavior) - If head != 0 and tail == 0, the function processes messages normally (rewind scenario) ### 4. Scope and Risk Assessment - **Size**: 1 file changed, 1 insertion, 18 deletions (net -17 lines) - **Scope**: Entirely self-contained within the mailbox driver - **Risk**: Very low - the fix removes a polling function and replaces it with a simple direct read, relying on already-existing validation code - **Regression potential**: Near zero - all the defensive checks (bounds validation, alignment, empty-buffer check) remain intact ### 5. Dependencies - No dependencies on other commits - `mailbox_get_tailptr()` already exists in v6.14 (line 152) - The patch applies cleanly to v6.14.y stable (surrounding code is identical) ### 6. User Impact - Affects users of AMD AI Engine (XDNA) accelerators (Ryzen AI NPUs) - The bug causes the accelerator to become permanently non-functional when the tail pointer is legitimately zero - This is a **device-killing bug** that requires driver reload/reboot to recover - The amdxdna driver exists in stable trees starting from v6.14 ### 7. Stability Indicators - Has `Reviewed-by: Maciej Falkowski <[email protected]>` - Author is `Lizhi Hou <[email protected]>` - an AMD engineer and the primary amdxdna driver developer - The fix was accepted via the drm subsystem merge path ### Conclusion This commit fixes a real, severe bug where a false positive error detection (treating zero tail pointer as invalid) permanently kills the mailbox communication channel in the AMD XDNA accelerator driver. The consequence is the accelerator device becoming completely non-functional until driver reload/reboot. The fix is small (net -17 lines), obviously correct (removes incorrect assumption, relies on existing validation), self-contained (single file, no dependencies), and applies cleanly to the v6.14 stable tree. It meets all stable kernel criteria. **YES** drivers/accel/amdxdna/amdxdna_mailbox.c | 19 +------------------ 1 file changed, 1 insertion(+), 18 deletions(-) diff --git a/drivers/accel/amdxdna/amdxdna_mailbox.c b/drivers/accel/amdxdna/amdxdna_mailbox.c index 858df97cd3fbd..a60a85ce564cb 100644 --- a/drivers/accel/amdxdna/amdxdna_mailbox.c +++ b/drivers/accel/amdxdna/amdxdna_mailbox.c @@ -112,22 +112,6 @@ static u32 mailbox_reg_read(struct mailbox_channel *mb_chann, u32 mbox_reg) return readl(ringbuf_addr); } -static int mailbox_reg_read_non_zero(struct mailbox_channel *mb_chann, u32 mbox_reg, u32 *val) -{ - struct xdna_mailbox_res *mb_res = &mb_chann->mb->res; - void __iomem *ringbuf_addr = mb_res->mbox_base + mbox_reg; - int ret, value; - - /* Poll till value is not zero */ - ret = readx_poll_timeout(readl, ringbuf_addr, value, - value, 1 /* us */, 100); - if (ret < 0) - return ret; - - *val = value; - return 0; -} - static inline void mailbox_set_headptr(struct mailbox_channel *mb_chann, u32 headptr_val) { @@ -286,8 +270,7 @@ static int mailbox_get_msg(struct mailbox_channel *mb_chann) u32 start_addr; int ret; - if (mailbox_reg_read_non_zero(mb_chann, mb_chann->res[CHAN_RES_I2X].mb_tail_ptr_reg, &tail)) - return -EINVAL; + tail = mailbox_get_tailptr(mb_chann, CHAN_RES_I2X); head = mb_chann->i2x_head; ringbuf_size = mailbox_get_ringbuf_size(mb_chann, CHAN_RES_I2X); start_addr = mb_chann->res[CHAN_RES_I2X].rb_start_addr; -- 2.51.0
