Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
The first application to open the autodump node gets the right to use it. All others only get -EBUSY until the first application is done with the hardware. Christian. Am 15.05.20 um 04:40 schrieb Zhao, Jiange: [AMD Official Use Only - Internal Distribution Only] Hi Dennis, This node/feature is for UMR extension. It is designed for a single user. Jiange *From:* Li, Dennis *Sent:* Thursday, May 14, 2020 11:15 PM *To:* Koenig, Christian ; Zhao, Jiange ; amd-gfx@lists.freedesktop.org *Cc:* Deucher, Alexander ; Pelloux-prayer, Pierre-eric ; Kuehling, Felix ; Liu, Monk ; Zhang, Hawking *Subject:* RE: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 [AMD Official Use Only - Internal Distribution Only] Hi, Jiange, How to handle the case that multi-apps do the auto dump? This patch seems not multi-process safety. Best Regards Dennis Li *From:*amd-gfx *On Behalf Of *Christian König *Sent:* Thursday, May 14, 2020 4:29 PM *To:* Zhao, Jiange ; amd-gfx@lists.freedesktop.org *Cc:* Deucher, Alexander ; Pelloux-prayer, Pierre-eric ; Kuehling, Felix ; Liu, Monk ; Zhang, Hawking *Subject:* Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Hi Jiange, it probably won't hurt, but I would just drop that. You need roughly 4 billion GPU resets until the UINT_MAX-1 becomes zero again. Christian. Am 14.05.20 um 09:14 schrieb Zhao, Jiange: [AMD Official Use Only - Internal Distribution Only] Hi Christian, wait_for_completion_interruptible_timeout() would decrease autodump.dumping.done to UINT_MAX-1. complete_all() here would restore autodump.dumping to the state as in amdgpu_debugfs_autodump_init(). I want to make sure every open() deals with the same situation. Jiange *From:* Christian König <mailto:ckoenig.leichtzumer...@gmail.com> *Sent:* Thursday, May 14, 2020 3:01 PM *To:* Zhao, Jiange <mailto:jiange.z...@amd.com>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org> *Cc:* Pelloux-prayer, Pierre-eric <mailto:pierre-eric.pelloux-pra...@amd.com>; Zhao, Jiange <mailto:jiange.z...@amd.com>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>; Deucher, Alexander <mailto:alexander.deuc...@amd.com>; Koenig, Christian <mailto:christian.koe...@amd.com>; Liu, Monk <mailto:monk@amd.com>; Zhang, Hawking <mailto:hawking.zh...@amd.com> *Subject:* Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Am 14.05.20 um 07:29 schrieb jia...@amd.com <mailto:jia...@amd.com>: > From: Jiange Zhao <mailto:jiange.z...@amd.com> > > When GPU got timeout, it would notify an interested part > of an opportunity to dump info before actual GPU reset. > > A usermode app would open 'autodump' node under debugfs system > and poll() for readable/writable. When a GPU reset is due, > amdgpu would notify usermode app through wait_queue_head and give > it 10 minutes to dump info. > > After usermode app has done its work, this 'autodump' node is closed. > On node closure, amdgpu gets to know the dump is done through > the completion that is triggered in release(). > > There is no write or read callback because necessary info can be > obtained through dmesg and umr. Messages back and forth between > usermode app and amdgpu are unnecessary. > > v2: (1) changed 'registered' to 'app_listening' > (2) add a mutex in open() to prevent race condition > > v3 (chk): grab the reset lock to avoid race in autodump_open, > rename debugfs file to amdgpu_autodump, > provide autodump_read as well, > style and code cleanups > > v4: add 'bool app_listening' to differentiate situations, so that > the node can be reopened; also, there is no need to wait for > completion when no app is waiting for a dump. > > v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' > add 'app_state_mutex' for race conditions: > (1)Only 1 user can open this file node > (2)wait_dump() can only take effect after poll() executed. > (3)eliminated the race condition between release() and > wait_dump() > > v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' > removed state checking in amdgpu_debugfs_wait_dump > Improve on top of version 3 so that the node can be reopened. > > v7: move reinit_completion into open() so
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
[AMD Official Use Only - Internal Distribution Only] Hi Dennis, This node/feature is for UMR extension. It is designed for a single user. Jiange From: Li, Dennis Sent: Thursday, May 14, 2020 11:15 PM To: Koenig, Christian ; Zhao, Jiange ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Pelloux-prayer, Pierre-eric ; Kuehling, Felix ; Liu, Monk ; Zhang, Hawking Subject: RE: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 [AMD Official Use Only - Internal Distribution Only] Hi, Jiange, How to handle the case that multi-apps do the auto dump? This patch seems not multi-process safety. Best Regards Dennis Li From: amd-gfx On Behalf Of Christian König Sent: Thursday, May 14, 2020 4:29 PM To: Zhao, Jiange ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Pelloux-prayer, Pierre-eric ; Kuehling, Felix ; Liu, Monk ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Hi Jiange, it probably won't hurt, but I would just drop that. You need roughly 4 billion GPU resets until the UINT_MAX-1 becomes zero again. Christian. Am 14.05.20 um 09:14 schrieb Zhao, Jiange: [AMD Official Use Only - Internal Distribution Only] Hi Christian, wait_for_completion_interruptible_timeout() would decrease autodump.dumping.done to UINT_MAX-1. complete_all() here would restore autodump.dumping to the state as in amdgpu_debugfs_autodump_init(). I want to make sure every open() deals with the same situation. Jiange From: Christian König <mailto:ckoenig.leichtzumer...@gmail.com> Sent: Thursday, May 14, 2020 3:01 PM To: Zhao, Jiange <mailto:jiange.z...@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org> Cc: Pelloux-prayer, Pierre-eric <mailto:pierre-eric.pelloux-pra...@amd.com>; Zhao, Jiange <mailto:jiange.z...@amd.com>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>; Deucher, Alexander <mailto:alexander.deuc...@amd.com>; Koenig, Christian <mailto:christian.koe...@amd.com>; Liu, Monk <mailto:monk@amd.com>; Zhang, Hawking <mailto:hawking.zh...@amd.com> Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Am 14.05.20 um 07:29 schrieb jia...@amd.com<mailto:jia...@amd.com>: > From: Jiange Zhao <mailto:jiange.z...@amd.com> > > When GPU got timeout, it would notify an interested part > of an opportunity to dump info before actual GPU reset. > > A usermode app would open 'autodump' node under debugfs system > and poll() for readable/writable. When a GPU reset is due, > amdgpu would notify usermode app through wait_queue_head and give > it 10 minutes to dump info. > > After usermode app has done its work, this 'autodump' node is closed. > On node closure, amdgpu gets to know the dump is done through > the completion that is triggered in release(). > > There is no write or read callback because necessary info can be > obtained through dmesg and umr. Messages back and forth between > usermode app and amdgpu are unnecessary. > > v2: (1) changed 'registered' to 'app_listening' > (2) add a mutex in open() to prevent race condition > > v3 (chk): grab the reset lock to avoid race in autodump_open, >rename debugfs file to amdgpu_autodump, >provide autodump_read as well, >style and code cleanups > > v4: add 'bool app_listening' to differentiate situations, so that > the node can be reopened; also, there is no need to wait for > completion when no app is waiting for a dump. > > v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' > add 'app_state_mutex' for race conditions: >(1)Only 1 user can open this file node >(2)wait_dump() can only take effect after poll() executed. >(3)eliminated the race condition between release() and > wait_dump() > > v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' > removed state checking in amdgpu_debugfs_wait_dump > Improve on top of version 3 so that the node can be reopened. > > v7: move reinit_completion into open() so that only one user > can open it. > > Signed-off-by: Jiange Zhao <mailto:jiange.z...@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 79 - > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + > 4 files changed, 88 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 2a806cb55b78..9e8eeddfe7ce 100644 > --- a/driver
RE: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
[AMD Official Use Only - Internal Distribution Only] Hi, Jiange, How to handle the case that multi-apps do the auto dump? This patch seems not multi-process safety. Best Regards Dennis Li From: amd-gfx On Behalf Of Christian König Sent: Thursday, May 14, 2020 4:29 PM To: Zhao, Jiange ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Pelloux-prayer, Pierre-eric ; Kuehling, Felix ; Liu, Monk ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Hi Jiange, it probably won't hurt, but I would just drop that. You need roughly 4 billion GPU resets until the UINT_MAX-1 becomes zero again. Christian. Am 14.05.20 um 09:14 schrieb Zhao, Jiange: [AMD Official Use Only - Internal Distribution Only] Hi Christian, wait_for_completion_interruptible_timeout() would decrease autodump.dumping.done to UINT_MAX-1. complete_all() here would restore autodump.dumping to the state as in amdgpu_debugfs_autodump_init(). I want to make sure every open() deals with the same situation. Jiange From: Christian König <mailto:ckoenig.leichtzumer...@gmail.com> Sent: Thursday, May 14, 2020 3:01 PM To: Zhao, Jiange <mailto:jiange.z...@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org> Cc: Pelloux-prayer, Pierre-eric <mailto:pierre-eric.pelloux-pra...@amd.com>; Zhao, Jiange <mailto:jiange.z...@amd.com>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>; Deucher, Alexander <mailto:alexander.deuc...@amd.com>; Koenig, Christian <mailto:christian.koe...@amd.com>; Liu, Monk <mailto:monk@amd.com>; Zhang, Hawking <mailto:hawking.zh...@amd.com> Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Am 14.05.20 um 07:29 schrieb jia...@amd.com<mailto:jia...@amd.com>: > From: Jiange Zhao <mailto:jiange.z...@amd.com> > > When GPU got timeout, it would notify an interested part > of an opportunity to dump info before actual GPU reset. > > A usermode app would open 'autodump' node under debugfs system > and poll() for readable/writable. When a GPU reset is due, > amdgpu would notify usermode app through wait_queue_head and give > it 10 minutes to dump info. > > After usermode app has done its work, this 'autodump' node is closed. > On node closure, amdgpu gets to know the dump is done through > the completion that is triggered in release(). > > There is no write or read callback because necessary info can be > obtained through dmesg and umr. Messages back and forth between > usermode app and amdgpu are unnecessary. > > v2: (1) changed 'registered' to 'app_listening' > (2) add a mutex in open() to prevent race condition > > v3 (chk): grab the reset lock to avoid race in autodump_open, >rename debugfs file to amdgpu_autodump, >provide autodump_read as well, >style and code cleanups > > v4: add 'bool app_listening' to differentiate situations, so that > the node can be reopened; also, there is no need to wait for > completion when no app is waiting for a dump. > > v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' > add 'app_state_mutex' for race conditions: >(1)Only 1 user can open this file node >(2)wait_dump() can only take effect after poll() executed. >(3)eliminated the race condition between release() and > wait_dump() > > v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' > removed state checking in amdgpu_debugfs_wait_dump > Improve on top of version 3 so that the node can be reopened. > > v7: move reinit_completion into open() so that only one user > can open it. > > Signed-off-by: Jiange Zhao <mailto:jiange.z...@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 79 - > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + > 4 files changed, 88 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 2a806cb55b78..9e8eeddfe7ce 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -992,6 +992,8 @@ struct amdgpu_device { >charproduct_number[16]; >charproduct_name[32]; >charserial[16]; > + > + struct amdgpu_autodump autodump; > }; > > static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device > *bdev) > diff --git a/drivers/gpu
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Am 14.05.20 um 11:18 schrieb jia...@amd.com: From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' removed state checking in amdgpu_debugfs_wait_dump Improve on top of version 3 so that the node can be reopened. v7: move reinit_completion into open() so that only one user can open it. v8: remove complete_all() from amdgpu_debugfs_wait_dump(). Signed-off-by: Jiange Zhao Reviewed-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 78 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 87 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a806cb55b78..9e8eeddfe7ce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..d33cb344be69 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,82 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (adev->autodump.dumping.done) { + reinit_completion(>autodump.dumping); + ret = 0; + } else { + ret = -EBUSY; + } + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete_all(>autodump.dumping); + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + poll_wait(file, >autodump.gpu_hang, poll_table); + + if (adev->in_gpu_reset) + return POLLIN | POLLRDNORM | POLLWRNORM; + + return 0; +} + +static const struct file_operations autodump_debug_fops = { + .owner = THIS_MODULE, + .open = amdgpu_debugfs_autodump_open, + .poll = amdgpu_debugfs_autodump_poll, + .release = amdgpu_debugfs_autodump_release,
[PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' removed state checking in amdgpu_debugfs_wait_dump Improve on top of version 3 so that the node can be reopened. v7: move reinit_completion into open() so that only one user can open it. v8: remove complete_all() from amdgpu_debugfs_wait_dump(). Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 78 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 87 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a806cb55b78..9e8eeddfe7ce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..d33cb344be69 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,82 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (adev->autodump.dumping.done) { + reinit_completion(>autodump.dumping); + ret = 0; + } else { + ret = -EBUSY; + } + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete_all(>autodump.dumping); + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + poll_wait(file, >autodump.gpu_hang, poll_table); + + if (adev->in_gpu_reset) + return POLLIN | POLLRDNORM | POLLWRNORM; + + return 0; +} + +static const struct file_operations autodump_debug_fops = { + .owner = THIS_MODULE, + .open = amdgpu_debugfs_autodump_open, + .poll = amdgpu_debugfs_autodump_poll, + .release = amdgpu_debugfs_autodump_release, +}; + +static void amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) +{ +
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Hi Jiange, it probably won't hurt, but I would just drop that. You need roughly 4 billion GPU resets until the UINT_MAX-1 becomes zero again. Christian. Am 14.05.20 um 09:14 schrieb Zhao, Jiange: [AMD Official Use Only - Internal Distribution Only] Hi Christian, wait_for_completion_interruptible_timeout() would decrease autodump.dumping.done to UINT_MAX-1. complete_all() here would restore autodump.dumping to the state as in amdgpu_debugfs_autodump_init(). I want to make sure every open() deals with the same situation. Jiange *From:* Christian König *Sent:* Thursday, May 14, 2020 3:01 PM *To:* Zhao, Jiange ; amd-gfx@lists.freedesktop.org *Cc:* Pelloux-prayer, Pierre-eric ; Zhao, Jiange ; Kuehling, Felix ; Deucher, Alexander ; Koenig, Christian ; Liu, Monk ; Zhang, Hawking *Subject:* Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Am 14.05.20 um 07:29 schrieb jia...@amd.com: > From: Jiange Zhao > > When GPU got timeout, it would notify an interested part > of an opportunity to dump info before actual GPU reset. > > A usermode app would open 'autodump' node under debugfs system > and poll() for readable/writable. When a GPU reset is due, > amdgpu would notify usermode app through wait_queue_head and give > it 10 minutes to dump info. > > After usermode app has done its work, this 'autodump' node is closed. > On node closure, amdgpu gets to know the dump is done through > the completion that is triggered in release(). > > There is no write or read callback because necessary info can be > obtained through dmesg and umr. Messages back and forth between > usermode app and amdgpu are unnecessary. > > v2: (1) changed 'registered' to 'app_listening' > (2) add a mutex in open() to prevent race condition > > v3 (chk): grab the reset lock to avoid race in autodump_open, > rename debugfs file to amdgpu_autodump, > provide autodump_read as well, > style and code cleanups > > v4: add 'bool app_listening' to differentiate situations, so that > the node can be reopened; also, there is no need to wait for > completion when no app is waiting for a dump. > > v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' > add 'app_state_mutex' for race conditions: > (1)Only 1 user can open this file node > (2)wait_dump() can only take effect after poll() executed. > (3)eliminated the race condition between release() and > wait_dump() > > v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' > removed state checking in amdgpu_debugfs_wait_dump > Improve on top of version 3 so that the node can be reopened. > > v7: move reinit_completion into open() so that only one user > can open it. > > Signed-off-by: Jiange Zhao > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 79 - > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + > 4 files changed, 88 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 2a806cb55b78..9e8eeddfe7ce 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -992,6 +992,8 @@ struct amdgpu_device { > char product_number[16]; > char product_name[32]; > char serial[16]; > + > + struct amdgpu_autodump autodump; > }; > > static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > index 1a4894fa3693..efee3f1adecf 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > @@ -27,7 +27,7 @@ > #include > #include > #include > - > +#include > #include > > #include "amdgpu.h" > @@ -74,8 +74,83 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, > return 0; > } > > +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) > +{ > +#if defined(CONFIG_DEBUG_FS) > + unsigned long timeout = 600 * HZ; > + int ret; > + > + wake_up_interruptible(>autodump.gpu_hang); > + > + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); > + complete_all(>autodump.dumping); Sorry that I'm mentioning this only now. But what is this complete_all() here good for? I mean we already waited for completion, didn't we? Christian.
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
[AMD Official Use Only - Internal Distribution Only] Hi Christian, wait_for_completion_interruptible_timeout() would decrease autodump.dumping.done to UINT_MAX-1. complete_all() here would restore autodump.dumping to the state as in amdgpu_debugfs_autodump_init(). I want to make sure every open() deals with the same situation. Jiange From: Christian K?nig Sent: Thursday, May 14, 2020 3:01 PM To: Zhao, Jiange ; amd-gfx@lists.freedesktop.org Cc: Pelloux-prayer, Pierre-eric ; Zhao, Jiange ; Kuehling, Felix ; Deucher, Alexander ; Koenig, Christian ; Liu, Monk ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Am 14.05.20 um 07:29 schrieb jia...@amd.com: > From: Jiange Zhao > > When GPU got timeout, it would notify an interested part > of an opportunity to dump info before actual GPU reset. > > A usermode app would open 'autodump' node under debugfs system > and poll() for readable/writable. When a GPU reset is due, > amdgpu would notify usermode app through wait_queue_head and give > it 10 minutes to dump info. > > After usermode app has done its work, this 'autodump' node is closed. > On node closure, amdgpu gets to know the dump is done through > the completion that is triggered in release(). > > There is no write or read callback because necessary info can be > obtained through dmesg and umr. Messages back and forth between > usermode app and amdgpu are unnecessary. > > v2: (1) changed 'registered' to 'app_listening' > (2) add a mutex in open() to prevent race condition > > v3 (chk): grab the reset lock to avoid race in autodump_open, >rename debugfs file to amdgpu_autodump, >provide autodump_read as well, >style and code cleanups > > v4: add 'bool app_listening' to differentiate situations, so that > the node can be reopened; also, there is no need to wait for > completion when no app is waiting for a dump. > > v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' > add 'app_state_mutex' for race conditions: >(1)Only 1 user can open this file node >(2)wait_dump() can only take effect after poll() executed. >(3)eliminated the race condition between release() and > wait_dump() > > v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' > removed state checking in amdgpu_debugfs_wait_dump > Improve on top of version 3 so that the node can be reopened. > > v7: move reinit_completion into open() so that only one user > can open it. > > Signed-off-by: Jiange Zhao > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 79 - > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + > 4 files changed, 88 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 2a806cb55b78..9e8eeddfe7ce 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -992,6 +992,8 @@ struct amdgpu_device { >charproduct_number[16]; >charproduct_name[32]; >charserial[16]; > + > + struct amdgpu_autodump autodump; > }; > > static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device > *bdev) > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > index 1a4894fa3693..efee3f1adecf 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > @@ -27,7 +27,7 @@ > #include > #include > #include > - > +#include > #include > > #include "amdgpu.h" > @@ -74,8 +74,83 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, >return 0; > } > > +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) > +{ > +#if defined(CONFIG_DEBUG_FS) > + unsigned long timeout = 600 * HZ; > + int ret; > + > + wake_up_interruptible(>autodump.gpu_hang); > + > + ret = > wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); > + complete_all(>autodump.dumping); Sorry that I'm mentioning this only now. But what is this complete_all() here good for? I mean we already waited for completion, didn't we? Christian. > + if (ret == 0) { > + pr_err("autodump: timeout, move on to gpu recovery\n"); > + return -ETIMEDOUT; > + } > +#endif > +
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Am 14.05.20 um 07:29 schrieb jia...@amd.com: From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' removed state checking in amdgpu_debugfs_wait_dump Improve on top of version 3 so that the node can be reopened. v7: move reinit_completion into open() so that only one user can open it. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 79 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 88 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a806cb55b78..9e8eeddfe7ce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..efee3f1adecf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,83 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + complete_all(>autodump.dumping); Sorry that I'm mentioning this only now. But what is this complete_all() here good for? I mean we already waited for completion, didn't we? Christian. + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (adev->autodump.dumping.done) { + reinit_completion(>autodump.dumping); + ret = 0; + } else { + ret = -EBUSY; + } + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete_all(>autodump.dumping); + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + poll_wait(file, >autodump.gpu_hang, poll_table); + + if (adev->in_gpu_reset) + return POLLIN | POLLRDNORM | POLLWRNORM; + + return 0; +} + +static const struct file_operations autodump_debug_fops = { + .owner = THIS_MODULE, + .open =
[PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' removed state checking in amdgpu_debugfs_wait_dump Improve on top of version 3 so that the node can be reopened. v7: move reinit_completion into open() so that only one user can open it. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 79 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 88 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a806cb55b78..9e8eeddfe7ce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..efee3f1adecf 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,83 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + complete_all(>autodump.dumping); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (adev->autodump.dumping.done) { + reinit_completion(>autodump.dumping); + ret = 0; + } else { + ret = -EBUSY; + } + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete_all(>autodump.dumping); + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + poll_wait(file, >autodump.gpu_hang, poll_table); + + if (adev->in_gpu_reset) + return POLLIN | POLLRDNORM | POLLWRNORM; + + return 0; +} + +static const struct file_operations autodump_debug_fops = { + .owner = THIS_MODULE, + .open = amdgpu_debugfs_autodump_open, + .poll = amdgpu_debugfs_autodump_poll, + .release = amdgpu_debugfs_autodump_release, +}; + +static void amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) +{ + init_completion(>autodump.dumping); +
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Since usermode app might open a file , do nothing and close it. That case is unproblematic since closing the debugfs file sets the state of the struct completion to completed again no matter if we waited or not. But when you don't reset in the open() callback we open a small windows between open and poll where userspace could open the debugfs file twice. Regards, Christian. Am 13.05.20 um 11:37 schrieb Zhao, Jiange: [AMD Official Use Only - Internal Distribution Only] Hi Christian, Since amdgpu_debugfs_wait_dump() would need 'audodump.dumping.done==0' to actually stop and wait for user mode app to dump. Since usermode app might open a file , do nothing and close it. I believe a poll() function would be a better indicator that the usermode app actually wants to do a dump. Also, a reset might happen between open() and poll(). The worst case would be wait_dump() would wait until timeout and usermode poll would always fail. Jiange -Original Message- From: Christian König Sent: Wednesday, May 13, 2020 4:20 PM To: Zhao, Jiange ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Pelloux-prayer, Pierre-eric ; Zhao, Jiange ; Koenig, Christian ; Liu, Monk Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Am 09.05.20 um 11:45 schrieb jia...@amd.com: From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' removed state checking in amdgpu_debugfs_wait_dump Improve on top of version 3 so that the node can be reopened. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 78 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 87 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a806cb55b78..9e8eeddfe7ce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..261b67ece7fb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,82 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) { #if +defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + complete_all(>autodump.dumping); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inod
RE: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
[AMD Official Use Only - Internal Distribution Only] Hi Christian, Since amdgpu_debugfs_wait_dump() would need 'audodump.dumping.done==0' to actually stop and wait for user mode app to dump. Since usermode app might open a file , do nothing and close it. I believe a poll() function would be a better indicator that the usermode app actually wants to do a dump. Also, a reset might happen between open() and poll(). The worst case would be wait_dump() would wait until timeout and usermode poll would always fail. Jiange -Original Message- From: Christian König Sent: Wednesday, May 13, 2020 4:20 PM To: Zhao, Jiange ; amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Pelloux-prayer, Pierre-eric ; Zhao, Jiange ; Koenig, Christian ; Liu, Monk Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Am 09.05.20 um 11:45 schrieb jia...@amd.com: > From: Jiange Zhao > > When GPU got timeout, it would notify an interested part of an > opportunity to dump info before actual GPU reset. > > A usermode app would open 'autodump' node under debugfs system and > poll() for readable/writable. When a GPU reset is due, amdgpu would > notify usermode app through wait_queue_head and give it 10 minutes to > dump info. > > After usermode app has done its work, this 'autodump' node is closed. > On node closure, amdgpu gets to know the dump is done through the > completion that is triggered in release(). > > There is no write or read callback because necessary info can be > obtained through dmesg and umr. Messages back and forth between > usermode app and amdgpu are unnecessary. > > v2: (1) changed 'registered' to 'app_listening' > (2) add a mutex in open() to prevent race condition > > v3 (chk): grab the reset lock to avoid race in autodump_open, >rename debugfs file to amdgpu_autodump, >provide autodump_read as well, >style and code cleanups > > v4: add 'bool app_listening' to differentiate situations, so that > the node can be reopened; also, there is no need to wait for > completion when no app is waiting for a dump. > > v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' > add 'app_state_mutex' for race conditions: > (1)Only 1 user can open this file node > (2)wait_dump() can only take effect after poll() executed. > (3)eliminated the race condition between release() and > wait_dump() > > v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' > removed state checking in amdgpu_debugfs_wait_dump > Improve on top of version 3 so that the node can be reopened. > > Signed-off-by: Jiange Zhao > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 78 - > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + > 4 files changed, 87 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 2a806cb55b78..9e8eeddfe7ce 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -992,6 +992,8 @@ struct amdgpu_device { > charproduct_number[16]; > charproduct_name[32]; > charserial[16]; > + > + struct amdgpu_autodump autodump; > }; > > static inline struct amdgpu_device *amdgpu_ttm_adev(struct > ttm_bo_device *bdev) diff --git > a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > index 1a4894fa3693..261b67ece7fb 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > @@ -27,7 +27,7 @@ > #include > #include > #include > - > +#include > #include > > #include "amdgpu.h" > @@ -74,8 +74,82 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, > return 0; > } > > +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) { #if > +defined(CONFIG_DEBUG_FS) > + unsigned long timeout = 600 * HZ; > + int ret; > + > + wake_up_interruptible(>autodump.gpu_hang); > + > + ret = > wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); > + complete_all(>autodump.dumping); > + if (ret == 0) { > + pr_err("autodump: timeout, move on to gpu recovery\n"); > + return -ETIMEDOUT; > + } > +#endif > + return 0; > +} > + > #if defined(CONFIG_DEBUG_FS) > > +static int amdgpu_debugfs_autodump_op
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Thanks for the reminder, had to much todo yesterday and just forgot about it. Christian. Am 13.05.20 um 10:16 schrieb Zhao, Jiange: [AMD Official Use Only - Internal Distribution Only] Hi @Koenig, Christian <mailto:christian.koe...@amd.com>, I made some changes on top of version 3 and tested it. Can you help review? Jiange *From:* Zhao, Jiange *Sent:* Saturday, May 9, 2020 5:45 PM *To:* amd-gfx@lists.freedesktop.org *Cc:* Koenig, Christian ; Pelloux-prayer, Pierre-eric ; Deucher, Alexander ; Liu, Monk ; Zhao, Jiange *Subject:* [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' removed state checking in amdgpu_debugfs_wait_dump Improve on top of version 3 so that the node can be reopened. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 78 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 87 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a806cb55b78..9e8eeddfe7ce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,8 @@ struct amdgpu_device { char product_number[16]; char product_name[32]; char serial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..261b67ece7fb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,82 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + complete_all(>autodump.dumping); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (adev->autodump.dumping.done) + ret = 0; + else + ret = -EBUSY; + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete_all(>autodump.dumping); + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = fil
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Am 09.05.20 um 11:45 schrieb jia...@amd.com: From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' removed state checking in amdgpu_debugfs_wait_dump Improve on top of version 3 so that the node can be reopened. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 78 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 87 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a806cb55b78..9e8eeddfe7ce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..261b67ece7fb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,82 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + complete_all(>autodump.dumping); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (adev->autodump.dumping.done) + ret = 0; + else + ret = -EBUSY; + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete_all(>autodump.dumping); + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + reinit_completion(>autodump.dumping); Why do you have the reinit_completion here and not in open callback? Apart from that looks good to me. Regards, Christian. + poll_wait(file, >autodump.gpu_hang, poll_table); + + if (adev->in_gpu_reset) + return POLLIN | POLLRDNORM | POLLWRNORM; + + return 0; +} + +static const struct file_operations autodump_debug_fops = { + .owner = THIS_MODULE, + .open = amdgpu_debugfs_autodump_open, + .poll = amdgpu_debugfs_autodump_poll, + .release = amdgpu_debugfs_autodump_release, +}; + +static void
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
[AMD Official Use Only - Internal Distribution Only] Hi @Koenig, Christian<mailto:christian.koe...@amd.com>, I made some changes on top of version 3 and tested it. Can you help review? Jiange From: Zhao, Jiange Sent: Saturday, May 9, 2020 5:45 PM To: amd-gfx@lists.freedesktop.org Cc: Koenig, Christian ; Pelloux-prayer, Pierre-eric ; Deucher, Alexander ; Liu, Monk ; Zhao, Jiange Subject: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' removed state checking in amdgpu_debugfs_wait_dump Improve on top of version 3 so that the node can be reopened. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 78 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 87 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a806cb55b78..9e8eeddfe7ce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..261b67ece7fb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,82 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + complete_all(>autodump.dumping); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (adev->autodump.dumping.done) + ret = 0; + else + ret = -EBUSY; + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete_all(>autodump.dumping); + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + reinit_completion(>autodump.dumping); + poll_wait(file, >autodump.gpu_hang, poll_table); + + if (adev->i
[PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex' removed state checking in amdgpu_debugfs_wait_dump Improve on top of version 3 so that the node can be reopened. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 78 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 87 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 2a806cb55b78..9e8eeddfe7ce 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -992,6 +992,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..261b67ece7fb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,82 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + complete_all(>autodump.dumping); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (adev->autodump.dumping.done) + ret = 0; + else + ret = -EBUSY; + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete_all(>autodump.dumping); + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + reinit_completion(>autodump.dumping); + poll_wait(file, >autodump.gpu_hang, poll_table); + + if (adev->in_gpu_reset) + return POLLIN | POLLRDNORM | POLLWRNORM; + + return 0; +} + +static const struct file_operations autodump_debug_fops = { + .owner = THIS_MODULE, + .open = amdgpu_debugfs_autodump_open, + .poll = amdgpu_debugfs_autodump_poll, + .release = amdgpu_debugfs_autodump_release, +}; + +static void amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) +{ + init_completion(>autodump.dumping); + complete_all(>autodump.dumping); + init_waitqueue_head(>autodump.gpu_hang); + +
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Am 06.05.20 um 05:45 schrieb Zhao, Jiange: [AMD Official Use Only - Internal Distribution Only] Hi Christian, Hi Jiange, well that looks correct to me, but seems to be a bit to complicated. What exactly was wrong with version 3? (1) If you open amdgpu_autodump, use it and close it, then you can't open it again, because wait_for_completion_interruptible_timeout() would decrement amdgpu_autodump.dumping.done to 0, then .open() would always return failure except the first time. In this case we should probably just use complete_all() instead of just complete(). So that the struct complete stays in the completed state. (2) reset lock is not optimal in this case. Because usermode app would take any operation at any time and there are so many race conditions to solve. A dedicated lock would be simpler and clearer. I don't think that this will work. Using the reset lock is mandatory here or otherwise we always race between a new process opening the file and an ongoing GPU reset. Just imagine what happens when the process which waited for the GPU reset event doesn't do a dump, but just closes and immediately reopens the file while the last reset is still ongoing. What we could do here is using mutex_trylock() on the reset lock and return -EBUSY when a reset is ongoing. Or maybe better mutex_lock_interruptible(). Please completely drop this extra check. Waking up the queue and waiting for completion should always work when done right. This check is very necessary, because if there is no usermode app listening, the following wait_for_completion_interruptible_timeout() would wait until timeout anyway, which is 10 minutes for nothing. This is not what we wanted. See the wait_event_* documentation, exactly that's what you should never do. Instead just signal the struct completion with complete_all() directly after it is created. This way the wakeup is a no-op and waiting for the struct completion continues immediately. Regards, Christian. Jiange -Original Message- From: Koenig, Christian Sent: Wednesday, April 29, 2020 10:09 PM To: Pelloux-prayer, Pierre-eric ; Zhao, Jiange ; amd-gfx@lists.freedesktop.org Cc: Kuehling, Felix ; Deucher, Alexander ; Liu, Monk ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Am 29.04.20 um 16:04 schrieb Pierre-Eric Pelloux-Prayer: Hi Jiange, This version seems to work fine. Tested-by: Pierre-Eric Pelloux-Prayer On 29/04/2020 07:08, Zhao, Jiange wrote: [AMD Official Use Only - Internal Distribution Only] Hi all, I worked out the race condition and here is version 5. Please have a look. Jiange - - - - - - - - - - - - - - *From:* Zhao, Jiange *Sent:* Wednesday, April 29, 2020 1:06 PM *To:* amd-gfx@lists.freedesktop.org *Cc:* Koenig, Christian ; Kuehling, Felix ; Pelloux-prayer, Pierre-eric ; Deucher, Alexander ; Zhang, Hawking ; Liu, Monk ; Zhao, Jiange *Subject:* [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups
RE: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
[AMD Official Use Only - Internal Distribution Only] Hi Christian, > Hi Jiange, well that looks correct to me, but seems to be a bit to > complicated. What exactly was wrong with version 3? (1) If you open amdgpu_autodump, use it and close it, then you can't open it again, because wait_for_completion_interruptible_timeout() would decrement amdgpu_autodump.dumping.done to 0, then .open() would always return failure except the first time. (2) reset lock is not optimal in this case. Because usermode app would take any operation at any time and there are so many race conditions to solve. A dedicated lock would be simpler and clearer. > Please completely drop this extra check. Waking up the queue and waiting for > completion should always work when done right. This check is very necessary, because if there is no usermode app listening, the following wait_for_completion_interruptible_timeout() would wait until timeout anyway, which is 10 minutes for nothing. This is not what we wanted. Jiange -Original Message- From: Koenig, Christian Sent: Wednesday, April 29, 2020 10:09 PM To: Pelloux-prayer, Pierre-eric ; Zhao, Jiange ; amd-gfx@lists.freedesktop.org Cc: Kuehling, Felix ; Deucher, Alexander ; Liu, Monk ; Zhang, Hawking Subject: Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 Am 29.04.20 um 16:04 schrieb Pierre-Eric Pelloux-Prayer: > Hi Jiange, > > This version seems to work fine. > > Tested-by: Pierre-Eric Pelloux-Prayer > > > > On 29/04/2020 07:08, Zhao, Jiange wrote: >> [AMD Official Use Only - Internal Distribution Only] >> >> >> Hi all, >> >> I worked out the race condition and here is version 5. Please have a look. >> >> Jiange >> - >> - >> - >> - >> - >> - >> - >> - >> - >> - >> - >> - >> - >> - >> >> *From:* Zhao, Jiange >> *Sent:* Wednesday, April 29, 2020 1:06 PM >> *To:* amd-gfx@lists.freedesktop.org >> *Cc:* Koenig, Christian ; Kuehling, Felix >> ; Pelloux-prayer, Pierre-eric >> ; Deucher, Alexander >> ; Zhang, Hawking ; >> Liu, Monk ; Zhao, Jiange >> *Subject:* [PATCH] drm/amdgpu: Add autodump debugfs node for gpu >> reset v4 >> >> From: Jiange Zhao >> >> When GPU got timeout, it would notify an interested part of an >> opportunity to dump info before actual GPU reset. >> >> A usermode app would open 'autodump' node under debugfs system and >> poll() for readable/writable. When a GPU reset is due, amdgpu would >> notify usermode app through wait_queue_head and give it 10 minutes to >> dump info. >> >> After usermode app has done its work, this 'autodump' node is closed. >> On node closure, amdgpu gets to know the dump is done through the >> completion that is triggered in release(). >> >> There is no write or read callback because necessary info can be >> obtained through dmesg and umr. Messages back and forth between >> usermode app and amdgpu are unnecessary. >> >> v2: (1) changed 'registered' to 'app_listening' >> (2) add a mutex in open() to prevent race condition >> >> v3 (chk): grab the reset lock to avoid race in autodump_open, >> rename debugfs file to amdgpu_autodump, >> provide autodump_read as well, >> style and code cleanups >> >> v4: add 'bool app_listening' to differentiate situations, so that >> the node can be reopened; also, there is no need to wait for >> completion when no app is waiting for a dump. >> >> v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' >> add 'app_state_mutex' for race conditions: >> (1)Only 1 user can open this file node
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Am 29.04.20 um 16:04 schrieb Pierre-Eric Pelloux-Prayer: Hi Jiange, This version seems to work fine. Tested-by: Pierre-Eric Pelloux-Prayer On 29/04/2020 07:08, Zhao, Jiange wrote: [AMD Official Use Only - Internal Distribution Only] Hi all, I worked out the race condition and here is version 5. Please have a look. Jiange -- *From:* Zhao, Jiange *Sent:* Wednesday, April 29, 2020 1:06 PM *To:* amd-gfx@lists.freedesktop.org *Cc:* Koenig, Christian ; Kuehling, Felix ; Pelloux-prayer, Pierre-eric ; Deucher, Alexander ; Zhang, Hawking ; Liu, Monk ; Zhao, Jiange *Subject:* [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() Hi Jiange, well that looks correct to me, but seems to be a bit to complicated. What exactly was wrong with version 3? One more comment below. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 92 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 14 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 109 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index bc1e0fd71a09..6f8ef98c4b97 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -990,6 +990,8 @@ struct amdgpu_device { char product_number[16]; char product_name[32]; char serial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..1d4a95e8ad5b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,96 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + mutex_lock(>autodump.app_state_mutex); + if (adev->autodump.app_state != AMDGPU_AUTODUMP_LISTENING) { + mutex_unlock(>autodump.app_state_mutex); +
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Hi Jiange, This version seems to work fine. Tested-by: Pierre-Eric Pelloux-Prayer On 29/04/2020 07:08, Zhao, Jiange wrote: > [AMD Official Use Only - Internal Distribution Only] > > > Hi all, > > I worked out the race condition and here is version 5. Please have a look. > > Jiange > -- > *From:* Zhao, Jiange > *Sent:* Wednesday, April 29, 2020 1:06 PM > *To:* amd-gfx@lists.freedesktop.org > *Cc:* Koenig, Christian ; Kuehling, Felix > ; Pelloux-prayer, Pierre-eric > ; Deucher, Alexander > ; Zhang, Hawking ; Liu, > Monk ; Zhao, Jiange > *Subject:* [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 > > From: Jiange Zhao > > When GPU got timeout, it would notify an interested part > of an opportunity to dump info before actual GPU reset. > > A usermode app would open 'autodump' node under debugfs system > and poll() for readable/writable. When a GPU reset is due, > amdgpu would notify usermode app through wait_queue_head and give > it 10 minutes to dump info. > > After usermode app has done its work, this 'autodump' node is closed. > On node closure, amdgpu gets to know the dump is done through > the completion that is triggered in release(). > > There is no write or read callback because necessary info can be > obtained through dmesg and umr. Messages back and forth between > usermode app and amdgpu are unnecessary. > > v2: (1) changed 'registered' to 'app_listening' > (2) add a mutex in open() to prevent race condition > > v3 (chk): grab the reset lock to avoid race in autodump_open, > rename debugfs file to amdgpu_autodump, > provide autodump_read as well, > style and code cleanups > > v4: add 'bool app_listening' to differentiate situations, so that > the node can be reopened; also, there is no need to wait for > completion when no app is waiting for a dump. > > v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' > add 'app_state_mutex' for race conditions: > (1)Only 1 user can open this file node > (2)wait_dump() can only take effect after poll() executed. > (3)eliminated the race condition between release() and > wait_dump() > > Signed-off-by: Jiange Zhao > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 92 - > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 14 > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + > 4 files changed, 109 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index bc1e0fd71a09..6f8ef98c4b97 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -990,6 +990,8 @@ struct amdgpu_device { > char product_number[16]; > char product_name[32]; > char serial[16]; > + > + struct amdgpu_autodump autodump; > }; > > static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device > *bdev) > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > index 1a4894fa3693..1d4a95e8ad5b 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c > @@ -27,7 +27,7 @@ > #include > #include > #include > - > +#include > #include > > #include "amdgpu.h" > @@ -74,8 +74,96 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, > return 0; > } > > +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) > +{ > +#if defined(CONFIG_DEBUG_FS
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
[AMD Official Use Only - Internal Distribution Only] Hi all, I worked out the race condition and here is version 5. Please have a look. Jiange From: Zhao, Jiange Sent: Wednesday, April 29, 2020 1:06 PM To: amd-gfx@lists.freedesktop.org Cc: Koenig, Christian ; Kuehling, Felix ; Pelloux-prayer, Pierre-eric ; Deucher, Alexander ; Zhang, Hawking ; Liu, Monk ; Zhao, Jiange Subject: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4 From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 92 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 14 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 109 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index bc1e0fd71a09..6f8ef98c4b97 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -990,6 +990,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..1d4a95e8ad5b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,96 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + mutex_lock(>autodump.app_state_mutex); + if (adev->autodump.app_state != AMDGPU_AUTODUMP_LISTENING) { + mutex_unlock(>autodump.app_state_mutex); + return 0; + } + mutex_unlock(>autodump.app_state_mutex); + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>autodump.app_state_mutex); + if (adev->autodump.app_state == AMDGPU_AUTODUMP_NO_APP) { + adev->autodump.app_state = AMDGPU_AUTODUMP_REGISTERED; + ret = 0; + } else { + ret = -EBUSY; + } + mutex_unlock(>autodump.app_state_mutex); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + mutex_lock(>autodump.app_state_mutex); + complete(>autodump.dumping); + adev->autodump.app_state = AMDGPU_AUTODUMP_NO_APP; + mutex_unlock(>
[PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. v5: change 'bool app_listening' to 'enum amdgpu_autodump_state' add 'app_state_mutex' for race conditions: (1)Only 1 user can open this file node (2)wait_dump() can only take effect after poll() executed. (3)eliminated the race condition between release() and wait_dump() Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 92 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 14 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 109 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index bc1e0fd71a09..6f8ef98c4b97 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -990,6 +990,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..1d4a95e8ad5b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,8 +74,96 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ +#if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + mutex_lock(>autodump.app_state_mutex); + if (adev->autodump.app_state != AMDGPU_AUTODUMP_LISTENING) { + mutex_unlock(>autodump.app_state_mutex); + return 0; + } + mutex_unlock(>autodump.app_state_mutex); + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + #if defined(CONFIG_DEBUG_FS) +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>autodump.app_state_mutex); + if (adev->autodump.app_state == AMDGPU_AUTODUMP_NO_APP) { + adev->autodump.app_state = AMDGPU_AUTODUMP_REGISTERED; + ret = 0; + } else { + ret = -EBUSY; + } + mutex_unlock(>autodump.app_state_mutex); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + mutex_lock(>autodump.app_state_mutex); + complete(>autodump.dumping); + adev->autodump.app_state = AMDGPU_AUTODUMP_NO_APP; + mutex_unlock(>autodump.app_state_mutex); + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + mutex_lock(>autodump.app_state_mutex); + poll_wait(file, >autodump.gpu_hang, poll_table); + adev->autodump.app_state = AMDGPU_AUTODUMP_LISTENING; + mutex_unlock(>autodump.app_state_mutex); + + if (adev->in_gpu_reset) + return POLLIN | POLLRDNORM | POLLWRNORM; + + return 0; +} + +static const struct
Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
Am 26.04.20 um 12:09 schrieb jia...@amd.com: From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. NAK, exactly that is racy and should be avoided. What problem are you seeing here? Regards, Christian. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 82 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 7 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 92 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index bc1e0fd71a09..6f8ef98c4b97 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -990,6 +990,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..04720264e8b9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,7 +74,85 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ #if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + if (!adev->autodump.app_listening) + return 0; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + +#if defined(CONFIG_DEBUG_FS) + +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (!adev->autodump.app_listening) { + adev->autodump.app_listening = true; + ret = 0; + } else { + ret = -EBUSY; + } + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete(>autodump.dumping); + adev->autodump.app_listening = false; + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + poll_wait(file, >autodump.gpu_hang, poll_table); + + if (adev->in_gpu_reset) + return POLLIN | POLLRDNORM | POLLWRNORM; + + return 0; +} + +static const struct file_operations autodump_debug_fops = { + .owner = THIS_MODULE, + .open = amdgpu_debugfs_autodump_open, + .poll = amdgpu_debugfs_autodump_poll, + .release = amdgpu_debugfs_autodump_release, +}; + +static void amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) +{ + init_completion(>autodump.dumping); + init_waitqueue_head(>autodump.gpu_hang); + adev->autodump.app_listening = false; + + debugfs_create_file("amdgpu_autodump", 0600, + adev->ddev->primary->debugfs_root, + adev, _debug_fops); +} /** * amdgpu_debugfs_process_reg_op - Handle MMIO register reads/writes @@ -1434,6
[PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4
From: Jiange Zhao When GPU got timeout, it would notify an interested part of an opportunity to dump info before actual GPU reset. A usermode app would open 'autodump' node under debugfs system and poll() for readable/writable. When a GPU reset is due, amdgpu would notify usermode app through wait_queue_head and give it 10 minutes to dump info. After usermode app has done its work, this 'autodump' node is closed. On node closure, amdgpu gets to know the dump is done through the completion that is triggered in release(). There is no write or read callback because necessary info can be obtained through dmesg and umr. Messages back and forth between usermode app and amdgpu are unnecessary. v2: (1) changed 'registered' to 'app_listening' (2) add a mutex in open() to prevent race condition v3 (chk): grab the reset lock to avoid race in autodump_open, rename debugfs file to amdgpu_autodump, provide autodump_read as well, style and code cleanups v4: add 'bool app_listening' to differentiate situations, so that the node can be reopened; also, there is no need to wait for completion when no app is waiting for a dump. Signed-off-by: Jiange Zhao --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 82 - drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h | 7 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 + 4 files changed, 92 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index bc1e0fd71a09..6f8ef98c4b97 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -990,6 +990,8 @@ struct amdgpu_device { charproduct_number[16]; charproduct_name[32]; charserial[16]; + + struct amdgpu_autodump autodump; }; static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index 1a4894fa3693..04720264e8b9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c @@ -27,7 +27,7 @@ #include #include #include - +#include #include #include "amdgpu.h" @@ -74,7 +74,85 @@ int amdgpu_debugfs_add_files(struct amdgpu_device *adev, return 0; } +int amdgpu_debugfs_wait_dump(struct amdgpu_device *adev) +{ #if defined(CONFIG_DEBUG_FS) + unsigned long timeout = 600 * HZ; + int ret; + + if (!adev->autodump.app_listening) + return 0; + + wake_up_interruptible(>autodump.gpu_hang); + + ret = wait_for_completion_interruptible_timeout(>autodump.dumping, timeout); + if (ret == 0) { + pr_err("autodump: timeout, move on to gpu recovery\n"); + return -ETIMEDOUT; + } +#endif + return 0; +} + +#if defined(CONFIG_DEBUG_FS) + +static int amdgpu_debugfs_autodump_open(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = inode->i_private; + int ret; + + file->private_data = adev; + + mutex_lock(>lock_reset); + if (!adev->autodump.app_listening) { + adev->autodump.app_listening = true; + ret = 0; + } else { + ret = -EBUSY; + } + mutex_unlock(>lock_reset); + + return ret; +} + +static int amdgpu_debugfs_autodump_release(struct inode *inode, struct file *file) +{ + struct amdgpu_device *adev = file->private_data; + + complete(>autodump.dumping); + adev->autodump.app_listening = false; + return 0; +} + +static unsigned int amdgpu_debugfs_autodump_poll(struct file *file, struct poll_table_struct *poll_table) +{ + struct amdgpu_device *adev = file->private_data; + + poll_wait(file, >autodump.gpu_hang, poll_table); + + if (adev->in_gpu_reset) + return POLLIN | POLLRDNORM | POLLWRNORM; + + return 0; +} + +static const struct file_operations autodump_debug_fops = { + .owner = THIS_MODULE, + .open = amdgpu_debugfs_autodump_open, + .poll = amdgpu_debugfs_autodump_poll, + .release = amdgpu_debugfs_autodump_release, +}; + +static void amdgpu_debugfs_autodump_init(struct amdgpu_device *adev) +{ + init_completion(>autodump.dumping); + init_waitqueue_head(>autodump.gpu_hang); + adev->autodump.app_listening = false; + + debugfs_create_file("amdgpu_autodump", 0600, + adev->ddev->primary->debugfs_root, + adev, _debug_fops); +} /** * amdgpu_debugfs_process_reg_op - Handle MMIO register reads/writes @@ -1434,6 +1512,8 @@ int amdgpu_debugfs_init(struct amdgpu_device *adev) amdgpu_ras_debugfs_create_all(adev); + amdgpu_debugfs_autodump_init(adev); + return