Re: [PATCH 2/2] [v3] drm/nouveau: expose GSP-RM logging buffers via debugfs
On 2/13/24 18:10, Timur Tabi wrote: On Tue, 2024-02-13 at 17:57 +0100, Danilo Krummrich wrote: + struct debugfs_blob_wrapper blob_init; + struct debugfs_blob_wrapper blob_intr; + struct debugfs_blob_wrapper blob_rm; + struct debugfs_blob_wrapper blob_pmu; + struct dentry *debugfs_logging_dir; I think we should not create those from within the nvkm layer, but rather pass them down through nvkm_device_pci_new(). Should they be created in nvkm_device_pci_new() also, even though we have no idea whether GSP is involved at that point? We can pass some structure to nvkm_device_pci_new() where GSP sets the pointers to the buffers and maybe a callback to clean them up. Once nvkm_device_pci_new() returns we'd see if something has been set or not. If so, we can go ahead and create the debugfs nodes. Lifecycle wise I think we should ensure that removing the Nouveau kernel module also cleans up those buffers. Even though keep-gsp-logging is considered unsafe, we shouldn't leak memory. I agree, but then there needs to be some way to keep these debugfs entries until the driver unloads. I don't know how to do that without creating some ugly global variables. I fear that might be the only option. However, I still prefer a global list over a memory leak. For instance, can we allocate corresponding buffers in the driver layer, copy things over and keep those buffers until nouveau_drm_exit()? This would also avoid exposing those DMA buffers via debugfs. The whole point behind this patch is to expose the buffers via debugfs. How else should they be exposed? I think I only thought about the fail case where we could just copy over the data from the DMA buffer to another buffer and expose that one instead. However, that doesn't work if GSP loads successfully. Hence, as mentioned above, we might just want to have some structure that we can add to the global list that contains the pointers to the DMA buffers and maybe a callback function to clean them up defined by the GSP code that we can simply call from nouveau_drm_exit().
Re: [PATCH 2/2] [v3] drm/nouveau: expose GSP-RM logging buffers via debugfs
On Tue, 2024-02-13 at 17:57 +0100, Danilo Krummrich wrote: + struct debugfs_blob_wrapper blob_init; + struct debugfs_blob_wrapper blob_intr; + struct debugfs_blob_wrapper blob_rm; + struct debugfs_blob_wrapper blob_pmu; + struct dentry *debugfs_logging_dir; I think we should not create those from within the nvkm layer, but rather pass them down through nvkm_device_pci_new(). Should they be created in nvkm_device_pci_new() also, even though we have no idea whether GSP is involved at that point? Lifecycle wise I think we should ensure that removing the Nouveau kernel module also cleans up those buffers. Even though keep-gsp-logging is considered unsafe, we shouldn't leak memory. I agree, but then there needs to be some way to keep these debugfs entries until the driver unloads. I don't know how to do that without creating some ugly global variables. For instance, can we allocate corresponding buffers in the driver layer, copy things over and keep those buffers until nouveau_drm_exit()? This would also avoid exposing those DMA buffers via debugfs. The whole point behind this patch is to expose the buffers via debugfs. How else should they be exposed?
Re: [PATCH 2/2] [v3] drm/nouveau: expose GSP-RM logging buffers via debugfs
On 2/12/24 22:15, Timur Tabi wrote: The LOGINIT, LOGINTR, LOGRM, and LOGPMU buffers are circular buffers that have printf-like logs from GSP-RM and PMU encoded in them. LOGINIT, LOGINTR, and LOGRM are allocated by Nouveau and their DMA addresses are passed to GSP-RM during initialization. The buffers are required for GSP-RM to initialize properly. LOGPMU is also allocated by Nouveau, but its contents are updated when Nouveau receives an NV_VGPU_MSG_EVENT_UCODE_LIBOS_PRINT RPC from GSP-RM. Nouveau then copies the RPC to the buffer. The messages are encoded as an array of variable-length structures that contain the parameters to an NV_PRINTF call. The format string and parameter count are stored in a special ELF image that contains only logging strings. This image is not currently shipped with the Nvidia driver. There are two methods to extract the logs. OpenRM tries to load the logging ELF, and if present, parses the log buffers in real time and outputs the strings to the kernel console. Alternatively, and this is the method used by this patch, the buffers can be exposed to user space, and a user-space tool (along with the logging ELF image) can parse the buffer and dump the logs. This method has the advantage that it allows the buffers to be parsed even when the logging ELF file is not available to the user. However, it has the disadvantage the debubfs entries need to remain until the driver is unloaded. The buffers are exposed via debugfs. The debugfs entries must be created before GSP-RM is started, to ensure that they are available during GSP-RM initialization. If GSP-RM fails to initialize, then Nouveau immediately shuts down the GSP interface. This would normally also deallocate the logging buffers, thereby preventing the user from capturing the debug logs. To avoid this, the keep-gsp-logging command line parameter can be specified. This parmater is marked as *unsafe* (thereby taining the kernel) because the DMA buffer and debugfs entries are never deallocated, even if the driver unloads. This gives the user the time to capture the logs, but it also means that resources can only be recovered by a reboot. An end-user can capture the logs using the following commands: cp /sys/kernel/debug/dri//loginit loginit cp /sys/kernel/debug/dri//logrm logrm cp /sys/kernel/debug/dri//logintr logintr cp /sys/kernel/debug/dri//logpmu logpmu where is the PCI ID of the GPU (e.g. :65:00.0). If keep-gsp-logging is specified, then the is the same but with -debug appended (e.g. :65:00.0-debug). Since LOGPMU is not needed for normal GSP-RM operation, it is only created if debugfs is available. Otherwise, the NV_VGPU_MSG_EVENT_UCODE_LIBOS_PRINT RPCs are ignored. Signed-off-by: Timur Tabi --- v3: reworked r535_gsp_libos_debugfs_init, rebased for drm-next .../gpu/drm/nouveau/include/nvkm/subdev/gsp.h | 12 + .../gpu/drm/nouveau/nvkm/subdev/gsp/r535.c| 215 +- 2 files changed, 223 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h index 3fbc57b16a05..2ee44bdf8be7 100644 --- a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h +++ b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h @@ -5,6 +5,8 @@ #include #include +#include + #define GSP_PAGE_SHIFT 12 #define GSP_PAGE_SIZE BIT(GSP_PAGE_SHIFT) @@ -217,6 +219,16 @@ struct nvkm_gsp { /* The size of the registry RPC */ size_t registry_rpc_size; + + /* +* Logging buffers in debugfs. The wrapper objects need to remain +* in memory until the dentry is deleted. +*/ + struct debugfs_blob_wrapper blob_init; + struct debugfs_blob_wrapper blob_intr; + struct debugfs_blob_wrapper blob_rm; + struct debugfs_blob_wrapper blob_pmu; + struct dentry *debugfs_logging_dir; I think we should not create those from within the nvkm layer, but rather pass them down through nvkm_device_pci_new(). Lifecycle wise I think we should ensure that removing the Nouveau kernel module also cleans up those buffers. Even though keep-gsp-logging is considered unsafe, we shouldn't leak memory. For instance, can we allocate corresponding buffers in the driver layer, copy things over and keep those buffers until nouveau_drm_exit()? This would also avoid exposing those DMA buffers via debugfs. }; static inline bool diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c index 86b62c7e1229..56209bf81360 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c @@ -26,6 +26,7 @@ #include #include #include +#include #include @@ -1979,6 +1980,196 @@ r535_gsp_rmargs_init(struct nvkm_gsp *gsp, bool resume) return 0; } +#define NV_GSP_MSG_EVENT_UCODE_LIBOS_CLASS_PMU 0xf3d722 + +/** + * r535_gsp_msg_libos
[PATCH 2/2] [v3] drm/nouveau: expose GSP-RM logging buffers via debugfs
The LOGINIT, LOGINTR, LOGRM, and LOGPMU buffers are circular buffers that have printf-like logs from GSP-RM and PMU encoded in them. LOGINIT, LOGINTR, and LOGRM are allocated by Nouveau and their DMA addresses are passed to GSP-RM during initialization. The buffers are required for GSP-RM to initialize properly. LOGPMU is also allocated by Nouveau, but its contents are updated when Nouveau receives an NV_VGPU_MSG_EVENT_UCODE_LIBOS_PRINT RPC from GSP-RM. Nouveau then copies the RPC to the buffer. The messages are encoded as an array of variable-length structures that contain the parameters to an NV_PRINTF call. The format string and parameter count are stored in a special ELF image that contains only logging strings. This image is not currently shipped with the Nvidia driver. There are two methods to extract the logs. OpenRM tries to load the logging ELF, and if present, parses the log buffers in real time and outputs the strings to the kernel console. Alternatively, and this is the method used by this patch, the buffers can be exposed to user space, and a user-space tool (along with the logging ELF image) can parse the buffer and dump the logs. This method has the advantage that it allows the buffers to be parsed even when the logging ELF file is not available to the user. However, it has the disadvantage the debubfs entries need to remain until the driver is unloaded. The buffers are exposed via debugfs. The debugfs entries must be created before GSP-RM is started, to ensure that they are available during GSP-RM initialization. If GSP-RM fails to initialize, then Nouveau immediately shuts down the GSP interface. This would normally also deallocate the logging buffers, thereby preventing the user from capturing the debug logs. To avoid this, the keep-gsp-logging command line parameter can be specified. This parmater is marked as *unsafe* (thereby taining the kernel) because the DMA buffer and debugfs entries are never deallocated, even if the driver unloads. This gives the user the time to capture the logs, but it also means that resources can only be recovered by a reboot. An end-user can capture the logs using the following commands: cp /sys/kernel/debug/dri//loginit loginit cp /sys/kernel/debug/dri//logrm logrm cp /sys/kernel/debug/dri//logintr logintr cp /sys/kernel/debug/dri//logpmu logpmu where is the PCI ID of the GPU (e.g. :65:00.0). If keep-gsp-logging is specified, then the is the same but with -debug appended (e.g. :65:00.0-debug). Since LOGPMU is not needed for normal GSP-RM operation, it is only created if debugfs is available. Otherwise, the NV_VGPU_MSG_EVENT_UCODE_LIBOS_PRINT RPCs are ignored. Signed-off-by: Timur Tabi --- v3: reworked r535_gsp_libos_debugfs_init, rebased for drm-next .../gpu/drm/nouveau/include/nvkm/subdev/gsp.h | 12 + .../gpu/drm/nouveau/nvkm/subdev/gsp/r535.c| 215 +- 2 files changed, 223 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h index 3fbc57b16a05..2ee44bdf8be7 100644 --- a/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h +++ b/drivers/gpu/drm/nouveau/include/nvkm/subdev/gsp.h @@ -5,6 +5,8 @@ #include #include +#include + #define GSP_PAGE_SHIFT 12 #define GSP_PAGE_SIZE BIT(GSP_PAGE_SHIFT) @@ -217,6 +219,16 @@ struct nvkm_gsp { /* The size of the registry RPC */ size_t registry_rpc_size; + + /* +* Logging buffers in debugfs. The wrapper objects need to remain +* in memory until the dentry is deleted. +*/ + struct debugfs_blob_wrapper blob_init; + struct debugfs_blob_wrapper blob_intr; + struct debugfs_blob_wrapper blob_rm; + struct debugfs_blob_wrapper blob_pmu; + struct dentry *debugfs_logging_dir; }; static inline bool diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c index 86b62c7e1229..56209bf81360 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c @@ -26,6 +26,7 @@ #include #include #include +#include #include @@ -1979,6 +1980,196 @@ r535_gsp_rmargs_init(struct nvkm_gsp *gsp, bool resume) return 0; } +#define NV_GSP_MSG_EVENT_UCODE_LIBOS_CLASS_PMU 0xf3d722 + +/** + * r535_gsp_msg_libos_print - capture log message from the PMU + * @priv: gsp pointer + * @fn: function number (ignored) + * @repv: pointer to libos print RPC + * @repc: message size + * + * See _kgspRpcUcodeLibosPrint + */ +static int r535_gsp_msg_libos_print(void *priv, u32 fn, void *repv, u32 repc) +{ + struct nvkm_gsp *gsp = priv; + struct nvkm_subdev *subdev = &gsp->subdev; + struct { + u32 ucodeEngDesc; + u32 libosPrintBufSize; + u8 libosPrintBuf[]; + } *rpc = repv; + unsigned int class = rpc->ucodeEngDesc >