[Devel] Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-31 Thread William Lee Irwin III
On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
>> Its ->wait_runtime will drop less significantly, which lets it be
>> inserted in rb-tree much to the left of those 1000 tasks (and which
>> indirectly lets it gain back its fair share during subsequent
>> schedule cycles).
>> Hmm ..is that the theory?

On Thu, May 31, 2007 at 02:26:00PM +0530, Srivatsa Vaddagiri wrote:
> My only concern is the time needed to converge to this fair
> distribution, especially in face of fluctuating workloads. For ex: a
> container who does a fork bomb can have a very adverse impact on
> other container's fair share under this scheme compared to other
> schemes which dedicate separate rb-trees for differnet containers
> (and which also support two level hierarchical scheduling inside the
> core scheduler).
> I am inclined to have the core scheduler support atleast two levels
> of hierarchy (to better isolate each container) and resort to the
> flattening trick for higher levels.

Yes, the larger number of schedulable entities and hence slower
convergence to groupwise weightings is a disadvantage of the flattening.
A hybrid scheme seems reasonable enough. Ideally one would chop the
hierarchy in pieces so that n levels of hierarchy become k levels of n/k
weight-flattened hierarchies for this sort of attack to be most effective
(at least assuming similar branching factors at all levels of hierarchy
and sufficient depth to the hierarchy to make it meaningful) but this is
awkward to do. Peeling off the outermost container or whichever level is
deemed most important in terms of accuracy of aggregate enforcement as
a hierarchical scheduler is a practical compromise.

Hybrid schemes will still incur the difficulties of hierarchical
scheduling, but they're by no means insurmountable. Sadly, only
complete flattening yields the simplifications that make task group
weighting enforcement orthogonal to load balancing and the like. The
scheme I described for global nice number behavior is also not readily
adaptable to hybrid schemes.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-31 Thread William Lee Irwin III
On Wed, May 30, 2007 at 11:36:47PM -0700, William Lee Irwin III wrote:
>> Temporarily, yes. All this only works when averaged out.

On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> So essentially when we calculate delta_mine component for each of those
> 1000 tasks, we will find that it has executed for 1 tick (4 ms say) but 
> its fair share was very very low.
>   fair_share = delta_exec * p->load_weight / total_weight
> If p->load_weight has been calculated after factoring in hierarchy (as
> you outlined in a previous mail), then p->load_weight of those 1000 tasks
> will be far less compared to the p->load_weight of one task belonging to
> other user, correct? Just to make sure I get all this correct:

You've got it all correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
>   User U1 has tasks T0 - T999
>   User U2 has task T1000
> assuming each task's weight is 1 and each user's weight is 1 then:
>   WT0 = (WU1 / WU1 + WU2) * (WT0 / WT0 + WT1 + ... + WT999)
>   = (1 / 1 + 1) * (1 / 1000)
>   = 1/2000
>   = 0.0005
>   WT1 ..WT999 will be same as WT0
> whereas, weight of T1000 will be:
>   WT1000  = (WU1 / WU1 + WU2) * (WT1000 / WT1000)
>   = (1 / 1 + 1) * (1/1)
>   = 0.5
> ?

Yes, these calculations are correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> So when T0 (or T1 ..T999) executes for 1 tick (4ms), their fair share would
> be:
>   T0's fair_share (delta_mine)
>   = 4 ms * 0.0005 / (0.0005 * 1000 + 0.5)
>   = 4 ms * 0.0005 / 1
>   = 0.002 ms (2000 ns)
> This would cause T0's ->wait_runtime to go negative sharply, causing it to be
> inserted back in rb-tree well ahead in future. One change I can forsee
> in CFS is with regard to limit_wait_runtime() ..We will have to change
> its default limit, atleast when group fairness thingy is enabled.
> Compared to this when T1000 executes for 1 tick, its fair share would be
> calculated as:
>   T1000's fair_share (delta_mine)
>   = 4 ms * 0.5 / (0.0005 * 1000 + 0.5)
>   = 4 ms * 0.5 / 1
>   = 2 ms (200 ns)
> Its ->wait_runtime will drop less significantly, which lets it be
> inserted in rb-tree much to the left of those 1000 tasks (and which indirectly
> lets it gain back its fair share during subsequent schedule cycles).

This analysis is again entirely correct.


On Thu, May 31, 2007 at 02:03:53PM +0530, Srivatsa Vaddagiri wrote:
> Hmm ..is that the theory?
> Ingo, do you have any comments on this approach?
> /me is tempted to try this all out.

Yes, this is the theory behind using task weights to flatten the task
group hierarchies. My prior post assumed all this and described a method
to make nice numbers behave as expected in the global context atop it.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-30 Thread William Lee Irwin III
On Wed, May 30, 2007 at 09:09:26PM -0700, William Lee Irwin III wrote:
>> It's not all that tricky. 

On Thu, May 31, 2007 at 11:18:28AM +0530, Srivatsa Vaddagiri wrote:
> Hmm ..the fact that each task runs for a minimum of 1 tick seems to
> complicate the matters to me (when doing group fairness given a single
> level hierarchy). A user with 1000 (or more) tasks can be unduly
> advantaged compared to another user with just 1 (or fewer) task
> because of this?

Temporarily, yes. All this only works when averaged out.  The basic
idea is that you want a constant upper bound on the difference between
the CPU time a task receives and the CPU time it was intended to get.
This discretization is one of the larger sources of the "error" in the
CPU time granted. The constant upper bound usually only applies to the
largest difference for any task. When absolute values of differences
are summed across tasks the aggregate will be O(tasks) because there's
something almost like a constant per-task lower bound a la Heisenberg.
It would have to get more exact the more tasks there are on the system
for that to work, and something of the opposite actually holds.

It might be appropriate for the scheduler to dynamically adjust a
periodic timer's period or to set up one-shot timers at involuntary
preemption times in order to achieve more precise fairness in this
sort of situation. In the case of few preemption points such one-shot
code or low periodicity code would also save on taking interrupts that
would otherwise manifest as overhead.

In short, a user with many tasks can reap a temporary advantage
relative to users with fewer tasks because of this, but over time,
longer-running tasks will receive the CPU time intended to within
some constant upper bound, provided other things aren't broken.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-30 Thread William Lee Irwin III
On Wed, May 30, 2007 at 01:13:59PM -0700, William Lee Irwin III wrote:
>> The step beyond was to show how nice numbers can be done with all that
>> hierarchical task grouping so they have global effects instead of
>> effects limited to the scope of the narrowest grouping hierarchy
>> containing the task. I had actually assumed the weighting and
>> flattening bits were already in your plans from some other post you
>> made and was building upon that.

On Thu, May 31, 2007 at 08:56:57AM +0530, Srivatsa Vaddagiri wrote:
> I would definitely be willing to try out any experiments you think of,
> esp those that allow the hierarchy to be flattened. atm fair_key
> calculation (in the context of cfs) seem to be the biggest challenge to 
> surmount for this to work.

It's not all that tricky. The ->fair_key computations are already
parametrized on load weights. The "task weights" here are just what
Linux calls "load weight," so we're largely done once task weights
are calculated.

The tricky part (if any) is essentially what you've already got nailed
down, that is, creating and manipulating the accounting objects for the
task groups or whatever you're calling them.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-30 Thread William Lee Irwin III
On Sat, May 26, 2007 at 08:41:12AM -0700, William Lee Irwin III wrote:
>> The smpnice affair is better phrased in terms of task weighting. It's
>> simple to honor nice in such an arrangement. First unravel the
>> grouping hierarchy, then weight by nice. This looks like
[...]
>> conveniently collapse to 1). Now that the hierarchy is flattened,
>> nice numbers can be factored in for t_1's final weight being
>> 0.7*0.36/(0.7*0.36+0.3*0.24+0.7*0.24+0.3*0.16) = 0.252/0.54 = 0.467..
>> and the others being 0.133.. (t_2), 0.311.. (t_3), and 0.0889.. (t_4).

On Wed, May 30, 2007 at 10:44:05PM +0530, Srivatsa Vaddagiri wrote:
> Hmm ..so do you think this weight decomposition can be used to flatten
> the tree all the way to a single level in case of cfs? That would mean we can 
> achieve group fairness with single level scheduling in cfs ..I am
> somewhat skeptical that we can achieve group fairness with a single
> level rb-tree (and w/o substantial changes to pick_next_task logic in cfs
> that is), but if it can be accomplished would definitely be a great win.

Yes, the hierarchy can be flattened completely and global task weights
computed and used to achieve group fairness. The changes aren't to
pick_next_task() but rather to the ->fair_key computations. In fact, I
went a step beyond that.


On Sat, May 26, 2007 at 08:41:12AM -0700, William Lee Irwin III wrote:
>> In such a manner nice numbers obey the principle of least surprise.

The step beyond was to show how nice numbers can be done with all that
hierarchical task grouping so they have global effects instead of
effects limited to the scope of the narrowest grouping hierarchy
containing the task. I had actually assumed the weighting and
flattening bits were already in your plans from some other post you
made and was building upon that.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 00/10] Containers(V10): Generic Process Containers

2007-05-30 Thread William Lee Irwin III
On Wed, May 30, 2007 at 12:14:55AM -0700, Andrew Morton wrote:
> So how do we do this?
> Is there any sneaky way in which we can modify the kernel so that this new
> code gets exercised more?  Obviously, tossing init into some default
> system-wide container would be a start.  But I wonder if we can be
> sneakier - for example, create a new container on each setuid(), toss the
> task into that.  Or something along those lines?

How about a container for each thread group, pgrp, session, and user?


-- wli

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-29 Thread William Lee Irwin III
On Mon, May 28, 2007 at 10:09:19PM +0530, Srivatsa Vaddagiri wrote:
> What do these task weights control? Timeslice primarily? If so, I am not
> sure how well it can co-exist with cfs then (unless you are planning to
> replace cfs with a equally good interactive/fair scheduler :)
> I would be very interested if this weight calculation can be used for
> smpnice based load balancing purposes too ..

Task weights represent shares of CPU bandwidth. If task i has weight w_i
then its share of CPU bandwidth is intended to be w_i/sum_i w_i.

"Load weight" seems to be used more in the scheduler source.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-29 Thread William Lee Irwin III
William Lee Irwin III wrote:
>> Lag is the deviation of a task's allocated CPU time from the CPU time
>> it would be granted by the ideal fair scheduling algorithm (generalized
>> processor sharing; take the limit of RR with per-task timeslices
>> proportional to load weight as the scale factor approaches zero).

On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
> Over what time period does this operate?

Over a period of time while the task is runnable.


William Lee Irwin III wrote:
>> Negative lag reflects receipt of excess CPU time. A close-to-canonical
>> "fairness metric" is the maximum of the absolute values of the lags of
>> all the tasks on the system. The "signed minimax pseudonorm" is the
>> largest lag without taking absolute values; it's a term I devised ad
>> hoc to describe the proposed algorithm.

On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
> So what you're saying is that you think dynamic priority (or its 
> equivalent) should be used for load balancing instead of static priority?

It doesn't do much in other schemes, but when fairness is directly
measured by the dynamic priority, it is a priori more meaningful.
This is not to say the net effect of using it is so different.


William Lee Irwin III wrote:
>> I've presented a coherent
>> algorithm. It may be that there's no demonstrable problem to solve.
>> On the other hand, if there really is a question as to how to load
>> balance in the presence of tasks pinned to cpus, I just answered it.
> 
On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
> Unless I missed something there's nothing in your suggestion that does 
> anything more about handling pinned tasks than is already done by the 
> load balancer.

There are two things, both of which are relatively subtle and
coincidentally happen to some extent. The first is the unpinned lag,
which behaves much differently from unpinned load weight even if it's
not so different in concept, but apparently achieves a similar result.
The second is the relativistic point of view, which happens somewhat by
coincidence anyway, but isn't formalized anywhere at all as a basis for
handling tasks pinned to cpus. The first difference is minor and the
second formalizing something that mostly or completely already happens.


William Lee Irwin III wrote:
>> There was a second issue raised to which I responded. I didn't stray
>> per se. I addressed a second topic in the post.

On Wed, May 30, 2007 at 10:09:28AM +1000, Peter Williams wrote:
> OK.
> To reiterate, I don't think that my suggestion is really necessary.  I 
> think that the current load balancing (stand fast a small bug that's 
> being investigated) will come up with a good distribution of tasks to 
> CPUs within the constraints imposed by any CPU affinity settings.

It's sort of like performance. If the numbers are already good enough,
my algorithm on that front need not be bothered with either.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-29 Thread William Lee Irwin III
William Lee Irwin III wrote:
>> Lag should be considered in lieu of load because lag

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> What's the definition of lag here?

Lag is the deviation of a task's allocated CPU time from the CPU time
it would be granted by the ideal fair scheduling algorithm (generalized
processor sharing; take the limit of RR with per-task timeslices
proportional to load weight as the scale factor approaches zero).
Negative lag reflects receipt of excess CPU time. A close-to-canonical
"fairness metric" is the maximum of the absolute values of the lags of
all the tasks on the system. The "signed minimax pseudonorm" is the
largest lag without taking absolute values; it's a term I devised ad
hoc to describe the proposed algorithm.


William Lee Irwin III wrote:
>> is what the
>> scheduler is trying to minimize;

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> This isn't always the case.  Some may prefer fairness to minimal lag. 
> Others may prefer particular tasks to receive preferential treatment.

This comment does not apply. Generalized processor sharing expresses
preferential treatment via weighting. Various other forms of
preferential treatment require more elaborate idealized models.


>> load is not directly relevant, but
>> appears to have some sort of relationship. Also, instead of pinned,
>> unpinned should be considered.

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> If you have total and pinned you can get unpinned.  It's probably 
> cheaper to maintain data for pinned than unpinned as there's less of it 
> on normal systems.

Regardless of the underlying accounting, I've presented a coherent
algorithm. It may be that there's no demonstrable problem to solve.
On the other hand, if there really is a question as to how to load
balance in the presence of tasks pinned to cpus, I just answered it.


William Lee Irwin III wrote:
>> Using the signed minimax pseudonorm (i.e. the highest
>> signed lag, where positive is higher than all negative regardless of
>> magnitude) on unpinned lags yields a rather natural load balancing
>> algorithm consisting of migrating from highest to lowest signed lag,
>> with progressively longer periods for periodic balancing across
>> progressively higher levels of hierarchy in sched_domains etc. as usual.
>> Basically skip over pinned tasks as far as lag goes.
>> The trick with all that comes when tasks are pinned within a set of
>> cpus (especially crossing sched_domains) instead of to a single cpu.

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> Yes, this makes the cost of maintaining the required data higher which 
> makes keeping pinned data more attractive than unpinned.
> BTW keeping data for sets of CPU affinities could cause problems as the 
> number of possible sets is quite large (being 2 to the power of the 
> number of CPUs).  So you need an algorithm based on pinned data for 
> single CPUs that knows the pinning isn't necessarily exclusive rather 
> than one based on sets of CPUs.  As I understand it (which may be 
> wrong), the mechanism you describe below takes that approach.

Yes, the mechanism I described takes that approach.


William Lee Irwin III wrote:
>> The smpnice affair is better phrased in terms of task weighting. It's
>> simple to honor nice in such an arrangement. First unravel the
>> grouping hierarchy, then weight by nice. This looks like
[...]
>> In such a manner nice numbers obey the principle of least surprise.

On Sun, May 27, 2007 at 11:29:51AM +1000, Peter Williams wrote:
> Is it just me or did you stray from the topic of handling cpu affinity 
> during load balancing to hierarchical load balancing?  I couldn't see 
> anything in the above explanation that would improve the handling of cpu 
> affinity.

There was a second issue raised to which I responded. I didn't stray
per se. I addressed a second topic in the post.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ckrm-tech] [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-26 Thread William Lee Irwin III
Srivatsa Vaddagiri wrote:
>> Ingo/Peter, any thoughts here?  CFS and smpnice probably is "broken" 
>> with respect to such example as above albeit for nice-based tasks.

On Sat, May 26, 2007 at 10:17:42AM +1000, Peter Williams wrote:
> See above.  I think that faced with cpu affinity use by the system 
> administrator that smpnice will tend towards a task to cpu allocation 
> that is (close to) the best that can be achieved without violating the 
> cpu affinity assignments.  (It may take a little longer than normal but 
> it should get there eventually.)
> You have to assume that the system administrator knows what (s)he's 
> doing and is willing to accept the impact of their policy decision on 
> the overall system performance.
> Having said that, if it was deemed necessary you could probably increase 
> the speed at which the load balancer converged on a good result in the 
> face of cpu affinity by keeping a "pinned weighted load" value for each 
> run queue and using that to modify find_busiest_group() and 
> find_busiest_queue() to be a bit smarter.   But I'm not sure that it 
> would be worth the added complexity.

Just in case anyone was looking for algorithms...

Lag should be considered in lieu of load because lag is what the
scheduler is trying to minimize; load is not directly relevant, but
appears to have some sort of relationship. Also, instead of pinned,
unpinned should be considered. It's unpinned that load balancing can
actually migrate. Using the signed minimax pseudonorm (i.e. the highest
signed lag, where positive is higher than all negative regardless of
magnitude) on unpinned lags yields a rather natural load balancing
algorithm consisting of migrating from highest to lowest signed lag,
with progressively longer periods for periodic balancing across
progressively higher levels of hierarchy in sched_domains etc. as usual.
Basically skip over pinned tasks as far as lag goes.

The trick with all that comes when tasks are pinned within a set of
cpus (especially crossing sched_domains) instead of to a single cpu.
There one can just consider a cpu to enter a periodic load balance
cycle, and then consider pushing and pulling, perhaps what could be
called the "exchange lags" for the pair of cpus. That would be the
minimax lag pseudonorms for the tasks migratable to both cpus of the
pair. That makes the notion of moving things from highest to lowest
lag (where load is now considered) unambiguous apart from whether all
this converges, but not when to actually try to load balance vs. when
not to, or when it's urgent vs. when it should be done periodically.

To clarify that, an O(cpus**2) notion appears to be necessary, namely
the largest exchange lag differential between any pair of cpus. There
is also the open question of whether moving tasks between cpus with the
highest exchange lag differential will actually reduce it or whether it
runs the risk of increasing it by creating a larger exchange lag
differential between different pairs of cpus. A similar open question
is raised by localizing balancing decisions to sched_domains. What
remains clear is that any such movement reduces the worst-case lag in
the whole system. Because of that, the worst-case lag in the whole
system monotonically decreases as balancing decisions are made, and
that much is subject to an infinite descent argument. Unfortunately,
determining the largest exchange lag differential appears to be more
complex than merely finding the highest and lowest lags. Bipartite
forms of the problem also arise from sched_domains.

I doubt anyone's really paying any sort of attention, so I'll not
really bother working out much more in the way of details with respect
to load balancing. It may be that there are better ways to communicate
algorithmic notions than prose descriptions. However, it's doubtful I'll
produce anything in a timely enough fashion to attract or hold interest.

The smpnice affair is better phrased in terms of task weighting. It's
simple to honor nice in such an arrangement. First unravel the
grouping hierarchy, then weight by nice. This looks like

tasknicehier1   hier2   ... hierN
t_1 w_n1w_h11   w_h21   ... w_hN1
t_2 w_n2w_h12   w_h22   ... w_hN2
...

For the example of nice 0 vs. nice 10 as distinct users with 10%
steppings between nice levels, one would have

tasknice   hier1
t_1 1  1
t_2 0.3855 1

w_1, the weight of t_1, would be
(w_h11*w_n1/(w_h11*w_n1 + w_h12*w_n2))
= (1*1/(1 + 1*0.3855..))
= 0.7217..
w_2, the weight of t_2, would be
(w_h12*w_n2/(w_h11*w_n1 + w_h12*w_n2))
= (1*0.3855../(1 + 1*0.3855..))
= 0.27826..
This just so happens to work out to being the same as if t_1 and t_2
had their respective nice numbers without the scheduler grouping, which
is basically what everyone wants to happen.

It's more obvious how to extend it to more tasks than levels of
hierarchy. An example of 

[Devel] Re: [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-23 Thread William Lee Irwin III
* William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> I'm not sure where it comes in as a design question. [...]

On Wed, May 23, 2007 at 09:55:28PM +0200, Ingo Molnar wrote:
> uhm, that's my whole point: it does _not_ come in as a design question 
> at all. You raised it, and i simply stated the fact that it doesnt 
> matter. Ok? :-) This is much ado about nothing i guess.

This is/was all overblown. There must've been some miscommunication.
I'll not try to clarify further.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-23 Thread William Lee Irwin III
* William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> [...] As an interface it may be poor and worse yet poorly specified, 
>> [...]

On Wed, May 23, 2007 at 09:26:54PM +0200, Ingo Molnar wrote:
> that's the only thing that matters to fundamental design questions like 
> this.

I'm not sure where it comes in as a design question. Tuning whatever's
done for sched_yield() is just one of the drudgery tasks of rounding
out the implementation as I see it. It can be painful depending on
what's being changed, though it's not clear to me precisely how painful
tuning it will be for this change.

If it turns out to be a large enough issue, I can enforce reasonable
semantics for it with no impact on the core scheduling algorithm anyway.


* William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> The content of my comment was that the patch does something to 
>> sched_yield() semantics, so it raises the question of what will happen 
>> in benchmarks and other performance affairs that are sensitive to 
>> sched_yield() semantics changes.

On Wed, May 23, 2007 at 09:26:54PM +0200, Ingo Molnar wrote:
> the correct aproach to the "sys_sched_yield() is an API that sucks" 
> problem is to simply _not use it_. User-space is figuring that out now, 
> fortunately.

We do need API's to displace it for that to really take off. I think
the yield_to() API's will help there. Sadly, I expect it to be a very
long time for the transition to really happen.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-23 Thread William Lee Irwin III
* William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> [...] sched_yield() semantics are yet another twist.

On Wed, May 23, 2007 at 08:40:35PM +0200, Ingo Molnar wrote:
> that's nonsense, sched_yield() semantics are totally uninteresting. It 
> is a fundamentally broken interface.

They're not totally uninteresting. People will complain when apps
break or perform poorly due to changes in its semantics. As an
interface it may be poor and worse yet poorly specified, but it has
non-negligible effects on performance issues that can't be ignored
and that will remain the case for the foreseeable future.

The content of my comment was that the patch does something to
sched_yield() semantics, so it raises the question of what will happen
in benchmarks and other performance affairs that are sensitive to
sched_yield() semantics changes.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-23 Thread William Lee Irwin III
On Wed, May 23, 2007 at 08:12:12PM +0200, Guillaume Chazarain wrote:
> o Doing an infinite loop as root seems to badly affect interactivity
> much more than with a normal user. Note that this is subjective, so
> maybe I'm smocking crack here.

CPU hogs as a distinct user will end up entitled to a whole user's
share of CPU bandwidth. That's just what it implements.


On Wed, May 23, 2007 at 08:12:12PM +0200, Guillaume Chazarain wrote:
> o Nice values are not reflected across users. From my test, if user1
> has a single busy loop at nice 19, and user2 a single busy loop at
> nice 0, both process will have a 50% CPU share, this looks wrong. Note
> that I have no idea how to solve this one.

This depends on how nice levels nest with task group weighting. If
they nest within task group weightings, the shares of CPU bandwidth
for single tasks having their own dedicated task groups will not
vary with nice number. It's harder to do the other way around with
hierarchical scheduling, but easy if the task groups merely determine
task weights.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] [PATCH 0/3] Add group fairness to CFS

2007-05-23 Thread William Lee Irwin III
On Wed, May 23, 2007 at 10:18:59PM +0530, Srivatsa Vaddagiri wrote:
> Here's an attempt to extend CFS (v13) to be fair at a group level, rather than
> just at task level. The patch is in a very premature state (passes
> simple tests, smp load balance not supported yet) at this point. I am sending 
> it out early to know if this is a good direction to proceed.

Well, SMP load balancing is what makes all this hard. sched_yield()
semantics are yet another twist.


On Wed, May 23, 2007 at 10:18:59PM +0530, Srivatsa Vaddagiri wrote:
> Salient points which needs discussion:
> 1. This patch reuses CFS core to achieve fairness at group level also.
>To make this possible, CFS core has been abstracted to deal with generic 
>schedulable "entities" (tasks, users etc).

The ability to handle deeper hierarchies would be useful for those
who want such semantics.


On Wed, May 23, 2007 at 10:18:59PM +0530, Srivatsa Vaddagiri wrote:
> 2. The per-cpu rb-tree has been split to be per-group per-cpu.
>schedule() now becomes two step on every cpu : pick a group first (from
>group rb-tree) and a task within that group next (from that group's task
>rb-tree)

That assumes per-user scheduling groups; other configurations would
make it one step for each level of hierarchy. It may be possible to
reduce those steps to only state transitions that change weightings
and incremental updates of task weightings. By and large, one needs
the groups to determine task weightings as opposed to hierarchically
scheduling, so there are alternative ways of going about this, ones
that would even make load balancing easier.


On Wed, May 23, 2007 at 10:18:59PM +0530, Srivatsa Vaddagiri wrote:
> 3. Grouping mechanism - I have used 'uid' as the basis of grouping for
>timebeing (since that grouping concept is already in mainline today).
>The patch can be adapted to a more generic process grouping mechanism
>(like http://lkml.org/lkml/2007/4/27/146) later.

I'd like to see how desirable the semantics achieved by reflecting
more of the process hierarchy structure in the scheduler groupings are.
Users, sessions, pgrps, and thread_groups would be the levels of
hierarchy there, where some handling of orphan pgrps is needed.


On Wed, May 23, 2007 at 10:18:59PM +0530, Srivatsa Vaddagiri wrote:
> Some results below, obtained on a 4way (with HT) Intel Xeon box. All 
> number are reflective of single CPU performance (tests were forced to 
> run on single cpu since load balance is not yet supported).
>uid "vatsa"   uid "guest"
>(make -s -j4 bzImage)(make -s -j20 bzImage)
> 
> 2.6.22-rc1  772.02 sec497.42 sec (real)
> 2.6.22-rc1+cfs-v13  780.62 sec478.35 sec (real)
> 2.6.22-rc1+cfs-v13+this patch 776.36 sec  776.68 sec (real)
> [ An exclusive cpuset containing only one CPU was created and the
> compilation jobs of both users were run simultaneously in this cpuset ]
> I also disabled CONFIG_FAIR_USER_SCHED and compared the results with
> cfs-v13:
>   uid "vatsa"
>   make -s -j4 bzImage
> 
> 2.6.22-rc1+cfs-v13395.57 sec (real)
> 2.6.22-rc1+cfs-v13+this_patch 388.54 sec (real)
> There is no regression I can see (rather some improvement, which I can't
> understand atm). I will run more tests later to check this regression aspect.
> Request your comments on the future direction to proceed!

Kernel compiles are markedly poor benchmarks. Try lat_ctx from lmbench,
VolanoMark, AIM7, OAST, SDET, and so on.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RSS controller v2 Test results (lmbench )

2007-05-21 Thread William Lee Irwin III
On Fri, 2007-05-18 at 09:37 +0530, Balbir Singh wrote:
>> oops! I wonder if AIM7 creates too many processes and exhausts all
>> memory. I've seen a case where during an upgrade of my tetex on my
>> laptop, the setup process failed and continued to fork processes
>> filling up 4GB of swap.

On Mon, May 21, 2007 at 09:53:34AM -0400, Lee Schermerhorn wrote:
> Jumping in late, I just want to note that in our investigations, when
> AIM7 gets into this situation [non-responsive system], it's because all
> cpus are in reclaim, spinning on an anon_vma spin lock.  AIM7 forks [10s
> of] thousands of children from a single parent, resultings in thousands
> of vmas on the anon_vma list.  shrink_inactive_list() must walk this
> list twice [page_referenced() and try_to_unmap()] under spin_lock for
> each anon page.  

I wonder how far out RCU'ing the anon_vma lock is.


On Mon, May 21, 2007 at 09:53:34AM -0400, Lee Schermerhorn wrote:
> [Aside:  Just last week, I encountered a similar situation on the
> i_mmap_lock for page cache pages running a 1200 user Oracle/OLTP run on
> a largish ia64 system.  Left the system spitting out "soft lockup"
> messages/stack dumps overnight.  Still spitting the next day, so I
> decided to reboot.]
> I have a patch that turns the anon_vma lock into a reader/writer lock
> that alleviates the problem somewhat, but with 10s of thousands of vmas
> on the lists, system still can't swap enough memory fast enough to
> recover.

Oh dear. Some algorithmic voodoo like virtually clustered scanning may
be in order in addition to anon_vma lock RCU'ing/etc.


On Mon, May 21, 2007 at 09:53:34AM -0400, Lee Schermerhorn wrote:
> We've run some AIM7 tests with Rik's "split lru list" patch, both with
> and without the anon_vma reader/writer lock patch.  We'll be posting
> results later this week.  Quick summary:  with Rik's patch, AIM
> performance tanks earlier, as the system starts swapping earlier.
> However, system remains responsive to shell input.  More into to follow.

I'm not sure where policy comes into this.


-- wli

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 6/7] Containers (V8): BeanCounters over generic process containers

2007-04-09 Thread William Lee Irwin III
On Fri, Apr 06, 2007 at 04:32:27PM -0700, [EMAIL PROTECTED] wrote:
> This patch implements the BeanCounter resource control abstraction
> over generic process containers. It contains the beancounter core
> code, plus the numfiles resource counter. It doesn't currently contain
> any of the memory tracking code or the code for switching beancounter
> context in interrupts.
> Currently all the beancounters resource counters are lumped into a
> single hierarchy; ideally it would be possible for each resource
> counter to be a separate container subsystem, allowing them to be

I'm very happy to see the BeanCounter work revived. The utility of
this work should extend far beyond feature work like containers and
into stability and reliability areas as well.


-- wli

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] kernel/pid.c pid allocation wierdness

2007-03-16 Thread William Lee Irwin III
William Lee Irwin III <[EMAIL PROTECTED]> writes:
>> I'd not mind something better than a hashtable. The fib tree may make
>> more sense than anticipated. It's truly better to switch data structures
>> completely than fiddle with e.g. hashtable sizes. However, bear in mind
>> the degenerate space behavior of radix trees in sparse contexts.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Grr.  s/patricia tree/fib tree/.  We use that in the networking for
> the forwarding information base and I got mis-remembered it.  Anyway
> the interesting thing with the binary version of radix tree is that
> path compression is well defined.  Path compression when you have
> multi-way branching is much more difficult.

Path compression isn't a big deal for multiway branching. I've usually
done it by aligning nodes and or'ing the number of levels to skip into
the lower bits of the pointer.


On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Sure.  One of the reasons to be careful with switching data
> structures.  Currently the hash tables typically operate at 10:1
> unsued:used entries.  4096 entries and 100 processes.

That would be 40:1, which is "worse" in some senses. That's not
going to fly well when pid namespaces proliferate.


On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> The current work actually focuses on changing the code so we reduce
> the total number of hash table looks ups, but persistently storing
> struct pid pointers instead of storing a pid_t.  This has a lot
> of benefits when it comes to implementing a pid namespace but the
> secondary performance benefit is nice as well.

I can't quite make out what you mean by all that. struct pid is already
what's in the hashtable.


On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Although my preliminary concern was the increase in typical list
> traversal length during lookup.  The current hash table typically does
> not have collisions so normally there is nothing to traverse.

Define "normally;" 1 threads and/or processes can be standard for
some affairs.


On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Meanwhile an rcu tree would tend towards 2-3 levels which would double
> or triple our number of cache line misses on lookup and thus reduce
> performance in the normal case.

It depends on the sort of tree used. For B+ it depends on branching
factors. For hash tries (not previously discussed) with path compression
new levels would only be created when the maximum node size is exceeded.
For radix trees (I'm thinking hand-coded) with path compression it
depends on dispersion especially above the pseudo-levels used for the
lowest bits.


William Lee Irwin III <[EMAIL PROTECTED]> writes:
>> RCU'ing radix trees is trendy but the current implementation needs a
>> spinlock where typically radix trees do not need them for RCU. I'm
>> talking this over with others interested in lockless radix tree
>> algorithms for reasons other than concurrency and/or parallelism.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> Sure.  I was thinking of the general class of data structures not the
> existing kernel one.  The multi-way branching and lack of comparisons
> looks interesting as a hash table replacement.  In a lot of cases
> storing hash values in a radix tree seems a very interesting
> alternative to a traditional hash table (and in that case the keys
> are randomly distributed and can easily be made dense).   For pids
> hashing the values first seems like a waste, and 

Comparisons vs. no comparisons is less interesting than cachelines
touched. B+ would either make RCU inconvenient or want comparisons by
dereferencing, which raises the number of cachelines touched.


William Lee Irwin III <[EMAIL PROTECTED]> writes:
>> I'd say you already have enough evidence to motivate a change of data
>> structure.

On Fri, Mar 16, 2007 at 07:04:28AM -0600, Eric W. Biederman wrote:
> I have enough to know that a better data structure could improve
> things.  Getting something that is clearly better than what
> we have now is difficult.  Plus I have a lot of other patches to
> coordinate.  I don't think the current data structure is going to
> fall over before I get the pid namespace implemented so that is
> my current priority.

I'll look into the data structure code, then.


William Lee Irwin III <[EMAIL PROTECTED]> writes:
>> Basically all that's needed for radix trees' space behavior to get
>> fixed up is proper path compression as opposed to the ->depth hack.
>> Done properly it also eliminates the need for a spinlock around the
>> whole radix tre

[Devel] Re: [RFC] kernel/pid.c pid allocation wierdness

2007-03-16 Thread William Lee Irwin III
William Lee Irwin III <[EMAIL PROTECTED]> writes:
>> Radix trees' space behavior is extremely poor in sparsely-populated
>> index spaces. There is no way they would save space or even come close
>> to the current space footprint.

On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> Possibly.  We aren't that sparsely populated when it comes to pids.
> Hash tables aren't good at saving space either, and when they are space
> efficient they are on the edge of long hash chains so they are on
> the edge of performance problems.  There is at least one variant of the
> fib tree that is as space efficient as any binary tree but works by
> looking at bits.  Not that I think that one makes much sense.

I'd not mind something better than a hashtable. The fib tree may make
more sense than anticipated. It's truly better to switch data structures
completely than fiddle with e.g. hashtable sizes. However, bear in mind
the degenerate space behavior of radix trees in sparse contexts.


William Lee Irwin III <[EMAIL PROTECTED]> writes:
>> Lock contention here would be a severe functional regression (note
>> "functional," not "performance;" the lock contention surrounding these
>> affairs takes down systems vs. mere slowdown nonsense), so it would
>> necessarily depend on lockless radix tree code for correctness.

On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> I don't know about the existing in kernel implementations but there is no
> reason we could not have an rcu protected radix tree.  At which point the
> challenges are about the same but we have indexes that would help us find
> the next free bit, which could reduce cpu time.

RCU'ing radix trees is trendy but the current implementation needs a
spinlock where typically radix trees do not need them for RCU. I'm
talking this over with others interested in lockless radix tree
algorithms for reasons other than concurrency and/or parallelism.


On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> The current pid implementation does not scale to larger process counts
> particularly well.  The hash table size is fixed so we get a lot of hash
> collisions etc.  Several other things go wonky as well.  The one time
> I actually had 64K pids on a system (most of them zombies) it was not
> a very pleasant situation.

If you've got a microbenchmark that would be convenient for me to use
while addressing it, unless you'd prefer to take it on yourself.
Otherwise I can always write one myself.


On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> It isn't common that we push the pid count up, and with the normal pid
> counts the current data structures seem very well suited to the
> problem.  I have been looking at the data structures though in case it
> ever changes because the current good behavior seems quite fragile.
> Not that I am advocating changing anything yet, but I'm thinking about
> it so when we do come to the point where it matters we can make a
> reasonable change.

I'd say you already have enough evidence to motivate a change of data
structure. Tree hashing (e.g. using balanced search trees in collision
chains) is generally good for eliminating straight-line issues of this
form but the available in-kernel tree structures don't do so well in
concurrent/parallel contexts and the utility of hashing becomes somewhat
questionable afterward given the stringent limits kernel environments
impose on pid spaces. I favor Peter Zijlstra's B+ trees once a few
relatively minor issues are addressed on account of good behavior in
sparse keyspaces, though cleanups (not stylistic) of radix trees' space
behavior may yet render them suitable, and may also be more popular to
pursue.

Basically all that's needed for radix trees' space behavior to get
fixed up is proper path compression as opposed to the ->depth hack.
Done properly it also eliminates the need for a spinlock around the
whole radix tree for RCU.


William Lee Irwin III <[EMAIL PROTECTED]> writes:
>> The comment block describing the hashtable locking is stale and should
>> have been updated in tandem with the RCU changes.

On Wed, Mar 14, 2007 at 10:54:07AM -0600, Eric W. Biederman wrote:
> Feel free to submit the patch.  I didn't make the RCU changes just took
> advantage of them.

I needed to note that because it and my description were in direct
conflict. I'd rather leave it for a kernel janitor or someone who needs
to get patches under their belt to sweep up as I've got enough laurels
to rest on and other areas where I can get large amounts of code in
easily enough, provided I get my act together on various fronts.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.osdl.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC] kernel/pid.c pid allocation wierdness

2007-03-16 Thread William Lee Irwin III
On Wed, Mar 14, 2007 at 08:12:35AM -0600, Eric W. Biederman wrote:
> If we do dig into this more we need to consider a radix_tree to hold
> the pid values.  That could replace both the pid map and the hash
> table, gracefully handle but large and small pid counts, might
> be a smidgin simpler, possibly be more space efficient, and it would
> more easily handle multiple pid namespaces.   The downside to using a
> radix tree is that is looks like it will have more cache misses for
> the normal pid map size, and it is yet another change that we would
> need to validate.

Radix trees' space behavior is extremely poor in sparsely-populated
index spaces. There is no way they would save space or even come close
to the current space footprint.

Lock contention here would be a severe functional regression (note
"functional," not "performance;" the lock contention surrounding these
affairs takes down systems vs. mere slowdown nonsense), so it would
necessarily depend on lockless radix tree code for correctness.

The comment block describing the hashtable locking is stale and should
have been updated in tandem with the RCU changes.


-- wli
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.osdl.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel