Re: Optimize VM handling a bit more

2018-09-10 Thread Zhang, Jerry (Junwei)

Apart from Felix comments,

Looks good for me, patch 2 ~ 8 are
Reviewed-by: Junwei Zhang 

Patch 9 ~ 11 are
Acked-by: Junwei Zhang 


On 09/10/2018 02:03 AM, Christian König wrote:

Hi everyone,

Especially on Vega and Raven VM handling is rather inefficient while creating 
PTEs because we originally only supported 2 level page tables and implemented 4 
level page tables on top of that.

This patch set reworks quite a bit of that handling and adds proper iterator 
and tree walking functions which are then used to update PTEs more efficiently.

A totally constructed test case which tried to map 2GB of VRAM on an unaligned 
address is reduced from 45ms down to ~20ms on my test system.

As a very positive side effect this also adds support for 1GB giant VRAM pages 
additional to the existing 2MB huge pages on Vega/Raven and also enables all 
additional power of two values (2MB-2GB) for the L1.

This could be beneficial for applications which allocate very huge amounts of 
memory because it reduces the overhead of page table walks by 50% (huge pages 
where 25%).

Please comment and/or review,
Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: Optimize VM handling a bit more

2018-09-10 Thread Felix Kuehling
Patches 2, 3, 5, 6, 8, 9, 11 are Reviewed-by: Felix Kuehling


I replied with comments to 1, 4, 7, 10.

On another thread, some of the machine learning guys found that the main
overhead of our memory allocator is clearing of BOs. I'm thinking about
a way to avoid that, but your patch 1 interferes with that.

My idea is to cache vram_page_split-sized drm_mm_nodes in
amdgpu_vram_mgr by process, instead of just freeing them. When the same
process allocates memory next, first try to use an existing node that
was already used by the same process. This would work for the common
case that there are no special alignment and placement restrictions.
Having most nodes of the same size (typically 2MB) helps and makes the
lookup of existing nodes very fast. Having to deal with different node
sizes would make it more difficult. Also the cache would likely
interfere with attempts to get large nodes in the first place.

I started some code, but I'm not sure I'll be able to send out something
working for review before my vacation at the end of this week, and then XDC.

Regards,
  Felix


On 2018-09-09 02:03 PM, Christian König wrote:
> Hi everyone,
>
> Especially on Vega and Raven VM handling is rather inefficient while creating 
> PTEs because we originally only supported 2 level page tables and implemented 
> 4 level page tables on top of that.
>
> This patch set reworks quite a bit of that handling and adds proper iterator 
> and tree walking functions which are then used to update PTEs more 
> efficiently.
>
> A totally constructed test case which tried to map 2GB of VRAM on an 
> unaligned address is reduced from 45ms down to ~20ms on my test system.
>
> As a very positive side effect this also adds support for 1GB giant VRAM 
> pages additional to the existing 2MB huge pages on Vega/Raven and also 
> enables all additional power of two values (2MB-2GB) for the L1.
>
> This could be beneficial for applications which allocate very huge amounts of 
> memory because it reduces the overhead of page table walks by 50% (huge pages 
> where 25%).
>
> Please comment and/or review,
> Christian.
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Optimize VM handling a bit more

2018-09-09 Thread Christian König
Hi everyone,

Especially on Vega and Raven VM handling is rather inefficient while creating 
PTEs because we originally only supported 2 level page tables and implemented 4 
level page tables on top of that.

This patch set reworks quite a bit of that handling and adds proper iterator 
and tree walking functions which are then used to update PTEs more efficiently.

A totally constructed test case which tried to map 2GB of VRAM on an unaligned 
address is reduced from 45ms down to ~20ms on my test system.

As a very positive side effect this also adds support for 1GB giant VRAM pages 
additional to the existing 2MB huge pages on Vega/Raven and also enables all 
additional power of two values (2MB-2GB) for the L1.

This could be beneficial for applications which allocate very huge amounts of 
memory because it reduces the overhead of page table walks by 50% (huge pages 
where 25%).

Please comment and/or review,
Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx