> The following files are added:
> 
> - expansion_limit - the maximum size (in bytes) to which this cpuset's
> memory set can be expanded
> 
> - expansion_pressure - the abstract memory pressure for tasks within
> the cpuset, ranging from 0 (no pressure) to 100 (about to go OOM) at
> which expansion can occur
> 
> - unused_mems (read-only) - the set of memory nodes that are available
> to this cpuset and not assigned to the mems set of any of this
> cpuset's children; automatically maintained by the system
> 
> - expansion_mems - a vector of nodelists that determine which nodes
> should be considered as potential expansion nodes, if available, in
> priority order


I've been struggling with a couple of variations on this theme.

I have one site that needs to have jobs that start swapping get killed.

I have another site that needs to have jobs that are about to start
kicking in the OOM killer instead get more memory - though they don't
need to run nearly as tight or carefully metered a limiter on how much
extra memory the job can get as your expansion mechanism allows for,
and they don't want to set aside unused nodes to divy out for these
cases, but rather they want to allow the tasks to start taking from
the nodes allowed to other jobs (yes, sounds odd, but I think that's
a fair statement.)

In both cases, these are very important adaptions to the particular
memory needs and workloads of particular sites, and difficult to
accomplish with existing mechanisms.  And neither variation seems to
be easily answered with this patch proposal either.

This patch adds the single largest expansion of per-cpuset attributes
of any change we've proposed.  My sense is that it is a tad overly
specialized to a particular situation (granted, a popular situation.)

But it tries to address a significant cause of difficulty in using
cpusets, so I am most encouraged.

How about a loadable module?

Instead of calling out to cpuset code that can expand the jobs cpuset
from a pre-defined pool, rather call out to a routine that can be
provided by a loadable module.  Keep the expansion_pressure, keep the
callout, but drop the rest and make the callout pluggable.

Then sites with specialized needs at the point of a particular amount
of pressure can provide specialized solutions.

I can imagine two more per-cpuset files, instead of the four above:

  memory_expansion_pressure - level, 0-100, at which the callout is called
  memory_expansion_routine - string name of a registered callout.

A routine would be available to the init routine of loading modules,
that let them register a callback by a string name, which would be
matched to the 'memory_expansion_routine' name, when a memory request
was made in a cpuset exceeding that cpusets specified
memory_expansion_pressure.

This would make the API quite a bit more generic and simple, and meet a
greater variety of needs.

Don't invoke the callout if the task can't sleep at that point; coders
of such loadable modules are ill-prepared to deal with that case, and
would sooner let such memory requests be handled as they are now.

Right now, if I had to cast a final vote (there is no 'final' vote, and
I wouldn't have it if there was ;), I'd much prefer a loadable module
hook here, then this particular 'expansion_mems' mechanism.

I'm open to discussing changing the value reported by 'memory_pressure'
into being the unfiltered metric needed here, to consolidate these two
metrics.  Now that I have some more real world experience with this
memory_pressure value, it is proving to have worth about half way
between what I hoped it would have, providing a user accessible leading
indicator of heavy swapping, and the lesser worth that I'd guess Andrew
was predicting for it.  If 'memory_pressure' is low, it means we are not
swapping heavily.  But if it is high, then either we are swapping or we
are pushing dirty pages to the file system.  If I ended up with a
loadable module hook that could be called out at a specified pressure
level, that would be a huge improvement from my perspective, and having
just a single pressure metric exposed in the API is a worthy goal.  I'm
sure that Andrew would get a kick out of applying the patch to remove
the single-pole low-pass recursive (IIR) filter code in cpuset.c ;).

For API compatibility, we should continue to have a per-cpuset metric
called memory_pressure, that tends to get bigger the greater the
memory distress, but it's negotiable just what the recipe is for
calculating that metric.  And it could become a value that is both
readable and writable, instead of just the read-only value it is now.

Obligatory nit - some places you have code such as:

+       /* if expansion isn't configured, don't expand */
+       if (cs->expansion_pressure < 0) return 0;
+       /* if memory pressure isn't high enough, don't expand */
+       if (pressure < cs->expansion_pressure) return 0;
+       /* if we're at the limit, don't expand */
+       if (cs->total_pages >= cs->expansion_limit) return 0;

The "if (...)" and the "return 0;" should be on separate lines.

Back to the main line of thought.  The locking could get tricky.

Perhaps an analog of CAS (compare and swap) works.  Provide the
callout code with a routine it can invoke that states in affect:

  if my tasks mems_allowed is This
    then change it to That
  else
    return failure (update This in place?)

Then we can invoke the callout routine not holding either of the
cpuset locks, manage_mutex or callback_mutex, and we can keep our
intricate details of cpuset locking the private business of mainline
kernel code, as they should be.

If we could arrange to invoke these callout routines not holding any
significant global lock, so that the callout routine could even go
so far as to invoke a separate thread to run user space code to muck
with cpusets all before returning, then that would be great.

By the way, I just happened to notice that the 0..100 pressure value
in this patch seems strangely like the 'distress' value in the
mm/vmscan.c:shrink_active_list() code.  But I'm no vm guru, so perhaps
this is a superficial similarity.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
ckrm-tech mailing list
https://lists.sourceforge.net/lists/listinfo/ckrm-tech

Reply via email to