Hi Christian,

Missing kunmap mapping in vmalloc will make kernel master page table incorrect. 
I would not call such issue as completely harmless. Please note that AMD 
graphic driver can run in 32 bit system. In 32 bit system, vmalloc address 
space is much smaller than size of most GPU VRAM.


amdgpu_bo_free_kernel has same issue as amdgpu_vram_scratch_fini. 1. It calls 
amdgpu_bo_reserve interruptible too. 2. It misses kunmap when amdgpu_bo_reserve 
returns error too. As result, kernel master page table can become incorrect, or 
as you call it "completely harmless vmalloc space leaking".


Because amdgpu_bo_free_kernel is used in more places, such as psp command 
submission, there will be bigger chance to have other usage where signal is not 
blocked. This will become a real bug.


I am thinking that we may fix the issue completely when TTM releases BO. But 
that is a bigger change.


It is good that you agree that there is no real world bad example caused by my 
patch. I will not discuss whether it is an improvement or not now to save time 
for both of us.


Thanks,

Alex Bin Xie


________________________________
From: Christian König <deathsim...@vodafone.de>
Sent: Wednesday, April 19, 2017 7:50 AM
To: Xie, AlexBin; Zhou, David(ChunMing); amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Without correctly kunmap, page table is corrupted. Page entries point to wrong 
memory locations. You might call it completely harmless. But I think this is a 
severe bug. Leaking memory is better than a corrupted page table. Think 
security.
We are talking about the page tables for the vmalloc area in the kernel here, 
so no security problem. Leaking memory is much more problematic.


Would you provide any document and reference by saying" It is impossible to 
receive a signal during module load/unload"? For example, if the unload stuck 
in a lock, can CTRL+C stop the unload?

No, CTRL+C doesn't abort module load/unload. There have been patches to changes 
this a while ago, but IIRC it broke a whole bunch of driver relying on this.


What about there is some other return error? What about in future somebody 
improve amdgpu_bo_reserve to return other errors, then function 
amdgpu_vram_scratch_fini becomes buggy?

Yes, that is indeed an issue. For example -EDEADLK is possible as well. That's 
why I said we should use amdgpu_bo_free_kernel() instead.


While I am thinking whether there is a better way for the current situation, 
would you give a real world example that my patch really not working? Then we 
can address it.

I don't think there is because the driver can't receive a signal during 
load/unload, but the problem is rather that the patch doesn't improve the 
situation at all.

Regards,
Christian.

Am 19.04.2017 um 13:37 schrieb Xie, AlexBin:

Hi Christian,


Without correctly kunmap, page table is corrupted. Page entries point to wrong 
memory locations. You might call it completely harmless. But I think this is a 
severe bug. Leaking memory is better than a corrupted page table. Think 
security.


Would you provide any document and reference by saying" It is impossible to 
receive a signal during module load/unload"? For example, if the unload stuck 
in a lock, can CTRL+C stop the unload?


If "It is impossible to receive a signal during module load/unload", 
interruptible waiting is fine too, because function amdgpu_bo_reserve will 
return successfully.


What about there is some other return error? What about in future somebody 
improve amdgpu_bo_reserve to return other errors, then function 
amdgpu_vram_scratch_fini becomes buggy?


While I am thinking whether there is a better way for the current situation, 
would you give a real world example that my patch really not working? Then we 
can address it.


Thanks,

Alex Bin

________________________________
From: Christian König <deathsim...@vodafone.de><mailto:deathsim...@vodafone.de>
Sent: Wednesday, April 19, 2017 2:35 AM
To: Xie, AlexBin; Zhou, David(ChunMing); 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

the answer is ttm_bo_kunmap isn't called at all and yes in the case of an iomap 
we leak the address space reserved.

But that is completely harmless on a 64bit system compared to leaking the 
memory backing the address space.

Using amdgpu_bo_free_kernel() instead of openly coding it here is probably a 
good idea.

Additional to that it's probably a good idea to set the no_intr flag when 
reserving kernel BOs. It is impossible to receive a signal during module 
load/unload, but it's probably better to document that in the code as well.

Regards,
Christian.

Am 18.04.2017 um 20:54 schrieb Xie, AlexBin:
Hi Christian,

Have you found how/where/when? When you said "mapping will just be released a 
bit later on", you must know the answer.

It is difficult to prove something does not exist. Anyway, I will give it a try 
to prove such "later on" does not exist.

Function ttm_bo_kunmap is the only function to unmap. To prove this, search 
ttm_bo_map_iomap, only ttm_bo_kunmap use this enum to correctly kunmap.

Function ttm_bo_kunmap is not called by ttm itself. This is a hint that all TTM 
delay delete mechanism or unref mechanism will NOT kunmap BO later on.

Function ttm_bo_kunmap is called by AMDGPU function amdgpu_bo_kunmap and 
amdgpu_gem_prime_vunmap.

Search AMDGPU for amdgpu_bo_kunmap. All matches do not kunmap for scratch VRAM 
BO.  amdgpu_bo_free_kernel is a suspect but the answer is still NO.

So all possibilities are searched. Did I miss anything?

Thanks,
Alex Bin Xie

________________________________
From: Xie, AlexBin
Sent: Tuesday, April 18, 2017 2:04:33 PM
To: Christian König; Zhou, David(ChunMing); 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO


Hi Christian,


Would you point out where/when will kunmap happen for this BO when release? It 
must be somewhere in some function calls.


I checked before I asked for review. But I did not see such obvious kunmap 
function call.


If so, there should be a comment in function amdgpu_vram_scratch_fini to avoid 
future confusion.

Thanks,
Alex Bin Xie
________________________________
From: Christian König <deathsim...@vodafone.de><mailto:deathsim...@vodafone.de>
Sent: Tuesday, April 18, 2017 1:46 PM
To: Xie, AlexBin; Zhou, David(ChunMing); 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO

Hi AlexBin,

No, David is right. This is a very common coding pattern in the kernel module.

Freeing up a BO while there still exists a kernel mapping is perfectly ok, the 
mapping will just be released a bit later on.

So this code is actually perfectly ok and just an optimization, but your patch 
breaks it and creates a memory leak.

Regards,
Christian.

Am 18.04.2017 um 17:17 schrieb Xie, AlexBin:

Hi David,


When amdgpu_bo_reserve return errors, we cannot release the BO. This is not a 
memory leak. General speaking, memory leak is unnoticed and unintentional.


The caller of function amdgpu_vram_scratch_fini ignores the return error 
value...


The "memory leak" is not caused by my patch. It is caused because reserving BO 
fails.


This patch only aim to make function amdgpu_vram_scratch_fini behave correctly.


To follow up, we can add a warning message when amdgpu_bo_reserve error happens 
in a different patch.


If function call amdgpu_bo_reserve is changed to uninterruptible, this changes 
driver behaviour. Without a substantial issue, I would be cautious for such a 
change.


Thanks,

Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Monday, April 17, 2017 10:38 PM
To: Xie, AlexBin; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月17日 22:54, Xie, AlexBin wrote:

Hi David,


Thanks for the comments. However, please have look at amdgpu_bo_reserve 
definition.

static inline int amdgpu_bo_reserve(struct amdgpu_bo *bo, bool no_intr)

Ah, this is a wired wrapper for ttm_bo_reserve.



When we call this function like the following:

         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
The false means interruptible.


On the other hand,  when amdgpu_bo_reserve function return error, why do we 
unref BO without kunmap and unpin the BO? This is wrong implementation when 
amdgpu_bo_reserve return any error.

Yeah, I see your mean, it's in driver un-loading, How about changing to no 
interruptible? Your patch will make a memleak if bo_reserve fails, but it seems 
not matter. I have no strong preference.

Regards,
David Zhou


Thanks,
Alex Bin Xie

________________________________
From: Zhou, David(ChunMing)
Sent: Friday, April 14, 2017 12:00 AM
To: Xie, AlexBin; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] dmr/amdgpu: Fix wrongly unref of BO



On 2017年04月14日 05:34, Alex Xie wrote:
> According to comment of amdgpu_bo_reserve, amdgpu_bo_reserve
> can return with -ERESTARTSYS. When this function was interrupted
> by a signal, BO should not be unref. Otherwise the BO might be
> released while is kmapped and pinned, or BO MIGHT be deref
> multiple times, etc.
         r = amdgpu_bo_reserve(adev->vram_scratch.robj, false);
we have specified interruptible to false, so -ERESTARTSYS isn't possible
here.

Thanks,
David Zhou
>
> Change-Id: If76071a768950a0d3ad9d5da7fcae04881807621
> Signed-off-by: Alex Xie <alexbin....@amd.com><mailto:alexbin....@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 53996e3..1dcc2d1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -355,8 +355,8 @@ static void amdgpu_vram_scratch_fini(struct amdgpu_device 
> *adev)
>                amdgpu_bo_kunmap(adev->vram_scratch.robj);
>                amdgpu_bo_unpin(adev->vram_scratch.robj);
>                amdgpu_bo_unreserve(adev->vram_scratch.robj);
> +             amdgpu_bo_unref(&adev->vram_scratch.robj);
>        }
> -     amdgpu_bo_unref(&adev->vram_scratch.robj);
>   }
>
>   /**





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx





_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to