Re: [RFC PATCH 8/9] drm/gem: Associate GEM objects with drm cgroup

2021-05-10 Thread Tamminen, Eero T
Hi,

Mon, 2021-05-10 at 17:36 +0200, Daniel Vetter wrote:
> 
...
> > If DRM allows user-space to exhaust all of system memory, this seems
> > to be a gap in enforcement of MEMCG limits for system memory.
> > I tried to look into it when this was discussed in the past
> > My guess is that shmem_read_mapping_page_gfp() ->
> > shmem_getpage_gfp() is not choosing the correct MM to charge against
> > in the use case of drivers using shmemfs for backing gem buffers.
> 
> Yeah we know about this one since forever. The bug report is roughly
> as old as the gem/ttm memory managers :-/ So another problem might be
> that if we now suddenly include gpu memory in the memcg accounting, we
> start breaking a bunch of workloads that worked just fine beforehand.

It's not the first time tightening security requires adapting settings
for running workloads...

Workload GPU memory usage needs to be significant portion of
application's real memory usage, to cause workload to hit limits that
have been set for it earlier.  Therefore I think it to definitely be
something that user setting such limits actually cares about.

=> I think the important thing is that reason for the failures is clear
from the OOM message.  This works much better if GPU related memory
usage is specifically stated in that message, once that memory starts to
be accounted for system memory.


- Eero



Re: [Intel-gfx] [PATCH] drm/i915/pmu: Check actual RC6 status

2021-04-01 Thread Tamminen, Eero T
Hi,

On Thu, 2021-04-01 at 05:54 -0400, Rodrigo Vivi wrote:
> On Thu, Apr 01, 2021 at 10:38:11AM +0100, Tvrtko Ursulin wrote:
...
> > I think it is possible to argue both ways.
> > 
> > 1)
> > HAS_RC6 means hardware has RC6 so if we view PMU as very low level
> > we can
> > say always export it.
> > 
> > If i915 had to turn it off (rc6->supported == false) due firmware or
> > GVT-g,
> > then we could say reporting zero RC6 is accurate in that sense. Only
> > the
> > reason "why it is zero" is missing for PMU users.
> > 
> > 2)
> > Or if we go with this patch we could say that presence of the PMU
> > metric
> > means RC6 is active and enabled, while absence means it is either
> > not
> > supported due platform (or firmware) or how the platform is getting
> > used
> > (GVT-g).
> > 
> 
> yeap, these 2 cases described well my mental conflict...
> 
> > So I think patch is a bit better. I don't see it is adding more
> > confusion.
> 
> As I said on the other patch I have no strong position on which is
> better,
> but if you and Eero feel that this works better for the current case,
> let's do it...

IMHO seeing case 1) i.e. zero RC6 could be slightly better from user
point of view than not seeing RC6 at all, because:

A) user then knows that GPU is not entering RC6, and

B) then the question is why it's not going to RC6 => one can see from
sysfs that it has been disabled


Whereas in case 2), the question is why there's no RC6 info, and user
doesn't know whether GPU is suspended or not (i.e. why GPU power
consumption is higher than expected).  It would help if i-g-t could show
e.g. "RC6 OFF" in that case.


- Eero

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel