On 2021-12-09 13:17, Robin Murphy wrote:
Sorry I missed this before...
On 2019-07-27 10:21, Xiongfeng Wang wrote:
Fix following crash that occurs when 'fq_flush_timeout()' access
'fq->lock' while 'iovad->fq' has been cleared. This happens when the
'fq_timer' handler is being executed and we call
'free_iova_flush_queue()'. When the timer handler is being executed,
its pending state is cleared and it is detached. This patch use
'del_timer_sync()' to wait for the timer handler 'fq_flush_timeout()' to
finish before destroying the flush queue.
So if I understand correctly, you shut down the device - which naturally
frees some DMA mappings into the FQ - then hotplug it out, such that
tearing down its group and default domain can end up racing with the
timeout firing on a different CPU? It would help if the commit message
actually explained that - I've just reverse-engineered it from the given
symptom - rather than focusing on details that aren't really important.
fq->lock is hardly significant, since *any* access to the FQ while it's
being destroyed is fundamentally unsound. I also spent way too long
trying to understand the significance of the full stack trace below
before realising that it is in fact just irrelevant - there's only one
way fq_flush_timeout() ever gets called, and it's the obvious one.
The fix itself seems reasonable - the kerneldoc for del_timer_sync() is
slightly scary, but since free_iova_flush_queue() doesn't touch any of
the locks and definitely shouldn't run in IRQ context I believe we're OK.
This will affect my IOVA refactoring series a little, so I'm happy to
help improve the writeup if you like - provided that my understanding is
actually correct - and include it in a v2 of that.
FWIW, this is what I came up with:
https://gitlab.arm.com/linux-arm/linux-rm/-/commit/ecea6835baca75b945bd8ecfaa636ff01dabcc1d
Let me know what you think.
Thanks,
Robin.
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu