Hello Joel,

its good that you finally have the time to discuss high level API parts more than one year after project start. Up to now I had to guess what you actually want.

Thanks for the Linux API summary. I think this makes it easier to find the right choice for RTEMS.

On 2014-05-21 19:06, Joel Sherrill wrote:

On 5/21/2014 9:13 AM, Sebastian Huber wrote:
On 2014-05-21 16:00, Joel Sherrill wrote:
Hi

We have an SMP behavioral decision to make and it isn't getting
enough discussion.

With cluster scheduling, there are potentially multiple scheduler
instances associated with non-overlapping subsets of cores.

With affinity, a thread can be restricted to execute on a subset
of the cores associated with a scheduler instance.

There are operations to change the scheduler associated with
a thread and the affinity of a thread.

The question is whether changing affinity should be able to
implicitly change the scheduler instance?

I lean to no because affinity and scheduler changes should
should be so rare in practical use that nothing should be
implicit.
The pthread_setaffinity_np() is the only way to set the scheduler via the
non-portable POSIX API.  So I would keep the ability to change the scheduler
with it.

Changing the scheduler via the affinity API violates what it does on Linux
and I assume *BSD. Linux has the concept of cpuset(7) and the taskset(1)
service.
Those are roughly comparable to scheduler instances in that they provide a
higher level restriction on what the affinity can do. You can't change
taskset
with affinity operations. The taskset is an additional filter.

I write a lot below and give options but the bottom line is that when you
implicitly change scheduler via affinity, you end up with edge cases,
arbitrary decision making inside RTEMS, and violate the rule of least
surprise.

IMO we should not overload the affinity APIs to move scheduler instances.
Ignoring any other reason I think this is bad, the user should be 100%
aware it is happening.
How would you change the scheduler with the non-portable POSIX API?

Since we started with an evaluation of the Linux and *BSD APIs, I think we
should return to them.

It is an error on Linux for pthread_setaffinity_np when
"The affinity bit mask contains
no processors that are currently physically on the system and
permitted to the thread according to any restrictions that may
be imposed by the "cpuset" mechanism described in cpuset(7).

I read this as an error for specifying a set where there are no
processors specified that are available to this thread. Filtered
by physical available and cpuset(7) restrictions. It is not an
error to specify processors that are physically present but
outside the cpuset(7).

But according to cpuset(7):

Cpusets are integrated with the sched_setaffinity(2) scheduling affin-
ity mechanism and the mbind(2) and set_mempolicy(2) memory-placement
mechanisms in the kernel. Neither of these mechanisms let a process
make use of a CPU or memory node that is not allowed by that process’s
cpuset. If changes to a process’s cpuset placement conflict with these
other mechanisms, then cpuset placement is enforced even if it means
overriding these other mechanisms.

If you treat cpuset(7) similar to a scheduler instance, then it is
an error to specify 0 cores within your scheduler instance.
Including a processor in the affinity mask outside those owned
by the taskset (e.g. scheduler instance) is not an error based on
the Linux man pages. Those extra bits in the affinity mask would
just be ignored.

On Linux, changing affinity explicitly can't move you outside that
cpuset(7). It is documented as a restriction on the affinity.

This allows you to have affinity for all processors and use the
scheduler instance or taskset(1) on Linux to balance.
Seems reasonable and the normal use case.

==Options==

As best I can tell, we have a few options.

Option 1. You can't do it via the POSIX API and must use the
Classic API. Linux certainly has its own unique tool with taskset(1).

The taskset(1) is a simple wrapper to sched_setaffinity()

https://gitorious.org/util-linux-ng/util-linux-ng/source/de878776623b120fc1e96568f4cd69c349ec2677:schedutils/taskset.c

What corresponds more to our clustered scheduling concept is the cpuset(7). It uses a pseudo-file-system interface.


Option 2. Add other POSIX APIs that are _np.

I don't really care if we add POSIX _np methods or say that
to change schedulers you have to use the Classic API. Since this
is so far beyond standardization, I leave to saying it is comparable
to Linux taskset(1) and you must use the Classic API scheduler
methods. It is tightly tied to OS implementation and system
configuration. POSIX will never address it.

In both of those options, there currently is some behavior that
does not match Linux. Since affinity is not stored by schedulers
without affinity support, there is no way to maintain affinity
information.

If we want to mimic this cpuset(7) behaviour, then we should store the affinity information for all schedulers.


Since the per-scheduler Node information is created when a
thread moves schedulers, affinity aware schedulers that maintain
this information only have a couple of options. The thread's
affinity mask can be implicitly set to all cores or all those owned
by this instance when it is moved to a scheduler with affinity support.

Any attempt by the application to set an affinity that would be
meaningful to another scheduler is ignored. This does not
follow the Linux/taskset behavior.

Yes, to follow the Linux behaviour we should store the affinity map independent of the particular scheduler (e.g. in Scheduler_Node), in the cpuset(7) man page is this:

http://man7.org/linux/man-pages/man7/cpuset.7.html

"Every  process  in the system belongs to exactly one cpuset."

-> In RTEMS we can interpret this as "Every task in the system belongs to exactly one scheduler instance".

"Cpusets are integrated with the sched_setaffinity(2) scheduling affinity mechanism and the mbind(2) and set_mempolicy(2) memory-placement mechanisms in the kernel. Neither of these mechanisms let a process make use of a CPU or memory node that is not allowed by that process's cpuset. If changes to a process's cpuset placement conflict with these other mechanisms, then cpuset placement is enforced even if it means overriding these other mechanisms. The kernel accomplishes this overriding by silently restricting the CPUs and memory nodes requested by these other mechanisms to those allowed by the invoking process's cpuset. This can result in these other calls returning an error, if for example, such a call ends up requesting an empty set of CPUs or memory nodes, after that request is restricted to the invoking process's cpuset."

-> In RTEMS we should not change the scheduler with the set affinity operation. We just have to make sure that the affinity map includes at least one processor of the current scheduler instance. If we change the scheduler and the affinity map has no processor in the target scheduler, then this should be an error.

Consider this sequence:

Four core system, scheduler A on 0/1, scheduler B on 2/3.
Application wants thread to have affinity for 0 and 2.
Thread starts on Scheduler A.

When it changes to B, the process results in thread A having
affinity for both 2 and 3 and could run on 3. Violating the
explicit application requested affinity.

I think the user explicitly selecting the scheduler and thread
affinity being part of the SMP node information is the best option.

Yes, I agree now.


I only see one error condition to consider when explicitly setting
the scheduler. That is when the thread's affinity mask does not
include the new scheduler instance's processors. This could easily
be checked at the "scheduler independent" level if we add a
cpuset to Scheduler_SMP_Context to indicate which cores the scheduler
instance owns. Then we could easily error check this with
the cpuset.h boolean operations.

Ok.


I also am now convinced that the SMP schedulers have
the option of ignoring affinity but they should honor the
data values. This was done by our original code but changed
upon request to not have the thread Affinity as part of
the basic Scheduler_SMP_Node.

Yes, sorry, I see it now differently.


Hmmm.. you could have an instance of a single CPU
scheduler in an SMP clustered configuration. This
means the affinity data really should be in
Scheduler_Node when SMP is enabled.

Options 3 -5 are implicit changes via set affinity but they
are not desirable to me. I think we could make them work.
But I don't like the implications, implicit actions, change from
Linux behavior, etc.

Option 3 (implicit). Allow setting affinity to no CPUs to remove a
thread from all schedulers. This would leave it in limbo
but it would be the caller's obligation to follow up with
a set affinity. I don't like this one because it trips the error
for Linux pthread_setaffinity_np() above when there are
no cores specified.

Option 3 is a no-go to me.

Option 4 (implicit): _Scheduler_Set_affinity should validate a 1->1
scheduler instance change and set the new affinity in
the new instance.

Option 5 is an easy optimization of 4. Add a cpuset to
Scheduler_SMP_Context indicating which cores are
associated with this scheduler instance.

Implicit scheduler changes break what I think is the
most useful case of clustered scheduling. Affinity for
cores in multiple schedulers, move threads dynamically
to different scheduler instances to perform load balancing.

Consider this scenario:

Scheduler A: cores 0-1
Scheduler B: cores 2-3

Thread 1 is associated with Scheduler B and with affinity 2-3
can run on either processor scheduled by B.

Thread 1 changes affinity 1-3. Should this change the scheduler,
be an error, or just have an affinity for a core in the system
that is not scheduled by this scheduler instance?
This is currently an error:

http://git.rtems.org/rtems/tree/testsuites/smptests/smpscheduler02/init.c#n141

OK. If I am reading this correctly, the destination affinity must be
within a single scheduler instance?

If so, then that is not compatible with the Linux use of affinity and
task sets. But I agree that it is the only safe use of implicit scheduler
changes.

It is only a problem when the scheduler logic moved it to another
scheduler instance.

If you look at the current code for _Scheduler_Set_affinity(),
it looks like the current behavior is none of the above and
appears to just be broken. Scheduler A's set affinity operation
is invoked and Scheduler B is not informed that it no longer
has control of Thread 1.

It is informed in case _Scheduler_default_Set_affinity_body() is used.  What is
broken is the _Scheduler_priority_affinity_SMP_Set_affinity() function.
What is broken is the concept of implicitly changing schedulers via
affinity.

I don't think its broken, but there are alternatives. I think that the alternative you proposed here is better. So we should remove the implicit scheduler changes from the set affinity operation.


By not discussing the options at the API level, you boxed yourself into an
implementation. Since user use-cases were also not discussed before you
started coding, there was no feedback on what would be considered
desirable user visible behavior.

If I would have waited up to now to start coding, then we would be in deeper troubles. The support for affinity maps was on your list and you didn't discuss anything in detail up to now. It is trivial to change the implementation to reflect the new requirements. I implemented and tested all the low level parts that you can use now.


Please have a look at the attached patch which I already sent to a similar 
thread.

Which patch?


It was attached to the last mail.

--
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax     : +49 89 189 47 41-09
E-Mail  : sebastian.hu...@embedded-brains.de
PGP     : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
_______________________________________________
rtems-devel mailing list
rtems-devel@rtems.org
http://www.rtems.org/mailman/listinfo/rtems-devel

Reply via email to