On 5/21/2014 9:13 AM, Sebastian Huber wrote: > On 2014-05-21 16:00, Joel Sherrill wrote: >> Hi >> >> We have an SMP behavioral decision to make and it isn't getting >> enough discussion. >> >> With cluster scheduling, there are potentially multiple scheduler >> instances associated with non-overlapping subsets of cores. >> >> With affinity, a thread can be restricted to execute on a subset >> of the cores associated with a scheduler instance. >> >> There are operations to change the scheduler associated with >> a thread and the affinity of a thread. >> >> The question is whether changing affinity should be able to >> implicitly change the scheduler instance? >> >> I lean to no because affinity and scheduler changes should >> should be so rare in practical use that nothing should be >> implicit. > The pthread_setaffinity_np() is the only way to set the scheduler via the > non-portable POSIX API. So I would keep the ability to change the scheduler > with it.
Changing the scheduler via the affinity API violates what it does on Linux and I assume *BSD. Linux has the concept of cpuset(7) and the taskset(1) service. Those are roughly comparable to scheduler instances in that they provide a higher level restriction on what the affinity can do. You can't change taskset with affinity operations. The taskset is an additional filter. I write a lot below and give options but the bottom line is that when you implicitly change scheduler via affinity, you end up with edge cases, arbitrary decision making inside RTEMS, and violate the rule of least surprise. IMO we should not overload the affinity APIs to move scheduler instances. Ignoring any other reason I think this is bad, the user should be 100% aware it is happening. > How would you change the scheduler with the non-portable POSIX API? Since we started with an evaluation of the Linux and *BSD APIs, I think we should return to them. It is an error on Linux for pthread_setaffinity_np when "The affinity bit mask contains no processors that are currently physically on the system and permitted to the thread according to any restrictions that may be imposed by the "cpuset" mechanism described in cpuset(7). I read this as an error for specifying a set where there are no processors specified that are available to this thread. Filtered by physical available and cpuset(7) restrictions. It is not an error to specify processors that are physically present but outside the cpuset(7). But according to cpuset(7): Cpusets are integrated with the sched_setaffinity(2) scheduling affin- ity mechanism and the mbind(2) and set_mempolicy(2) memory-placement mechanisms in the kernel. Neither of these mechanisms let a process make use of a CPU or memory node that is not allowed by that process’s cpuset. If changes to a process’s cpuset placement conflict with these other mechanisms, then cpuset placement is enforced even if it means overriding these other mechanisms. If you treat cpuset(7) similar to a scheduler instance, then it is an error to specify 0 cores within your scheduler instance. Including a processor in the affinity mask outside those owned by the taskset (e.g. scheduler instance) is not an error based on the Linux man pages. Those extra bits in the affinity mask would just be ignored. On Linux, changing affinity explicitly can't move you outside that cpuset(7). It is documented as a restriction on the affinity. This allows you to have affinity for all processors and use the scheduler instance or taskset(1) on Linux to balance. Seems reasonable and the normal use case. ==Options== As best I can tell, we have a few options. Option 1. You can't do it via the POSIX API and must use the Classic API. Linux certainly has its own unique tool with taskset(1). Option 2. Add other POSIX APIs that are _np. I don't really care if we add POSIX _np methods or say that to change schedulers you have to use the Classic API. Since this is so far beyond standardization, I leave to saying it is comparable to Linux taskset(1) and you must use the Classic API scheduler methods. It is tightly tied to OS implementation and system configuration. POSIX will never address it. In both of those options, there currently is some behavior that does not match Linux. Since affinity is not stored by schedulers without affinity support, there is no way to maintain affinity information. Since the per-scheduler Node information is created when a thread moves schedulers, affinity aware schedulers that maintain this information only have a couple of options. The thread's affinity mask can be implicitly set to all cores or all those owned by this instance when it is moved to a scheduler with affinity support. Any attempt by the application to set an affinity that would be meaningful to another scheduler is ignored. This does not follow the Linux/taskset behavior. Consider this sequence: Four core system, scheduler A on 0/1, scheduler B on 2/3. Application wants thread to have affinity for 0 and 2. Thread starts on Scheduler A. When it changes to B, the process results in thread A having affinity for both 2 and 3 and could run on 3. Violating the explicit application requested affinity. I think the user explicitly selecting the scheduler and thread affinity being part of the SMP node information is the best option. I only see one error condition to consider when explicitly setting the scheduler. That is when the thread's affinity mask does not include the new scheduler instance's processors. This could easily be checked at the "scheduler independent" level if we add a cpuset to Scheduler_SMP_Context to indicate which cores the scheduler instance owns. Then we could easily error check this with the cpuset.h boolean operations. I also am now convinced that the SMP schedulers have the option of ignoring affinity but they should honor the data values. This was done by our original code but changed upon request to not have the thread Affinity as part of the basic Scheduler_SMP_Node. Hmmm.. you could have an instance of a single CPU scheduler in an SMP clustered configuration. This means the affinity data really should be in Scheduler_Node when SMP is enabled. Options 3 -5 are implicit changes via set affinity but they are not desirable to me. I think we could make them work. But I don't like the implications, implicit actions, change from Linux behavior, etc. Option 3 (implicit). Allow setting affinity to no CPUs to remove a thread from all schedulers. This would leave it in limbo but it would be the caller's obligation to follow up with a set affinity. I don't like this one because it trips the error for Linux pthread_setaffinity_np() above when there are no cores specified. Option 3 is a no-go to me. Option 4 (implicit): _Scheduler_Set_affinity should validate a 1->1 scheduler instance change and set the new affinity in the new instance. Option 5 is an easy optimization of 4. Add a cpuset to Scheduler_SMP_Context indicating which cores are associated with this scheduler instance. Implicit scheduler changes break what I think is the most useful case of clustered scheduling. Affinity for cores in multiple schedulers, move threads dynamically to different scheduler instances to perform load balancing. >> Consider this scenario: >> >> Scheduler A: cores 0-1 >> Scheduler B: cores 2-3 >> >> Thread 1 is associated with Scheduler B and with affinity 2-3 >> can run on either processor scheduled by B. >> >> Thread 1 changes affinity 1-3. Should this change the scheduler, >> be an error, or just have an affinity for a core in the system >> that is not scheduled by this scheduler instance? > This is currently an error: > > http://git.rtems.org/rtems/tree/testsuites/smptests/smpscheduler02/init.c#n141 OK. If I am reading this correctly, the destination affinity must be within a single scheduler instance? If so, then that is not compatible with the Linux use of affinity and task sets. But I agree that it is the only safe use of implicit scheduler changes. It is only a problem when the scheduler logic moved it to another scheduler instance. >> If you look at the current code for _Scheduler_Set_affinity(), >> it looks like the current behavior is none of the above and >> appears to just be broken. Scheduler A's set affinity operation >> is invoked and Scheduler B is not informed that it no longer >> has control of Thread 1. >> > It is informed in case _Scheduler_default_Set_affinity_body() is used. What > is > broken is the _Scheduler_priority_affinity_SMP_Set_affinity() function. What is broken is the concept of implicitly changing schedulers via affinity. By not discussing the options at the API level, you boxed yourself into an implementation. Since user use-cases were also not discussed before you started coding, there was no feedback on what would be considered desirable user visible behavior. > Please have a look at the attached patch which I already sent to a similar > thread. > Which patch? -- Joel Sherrill, Ph.D. Director of Research & Development joel.sherr...@oarcorp.com On-Line Applications Research Ask me about RTEMS: a free RTOS Huntsville AL 35805 Support Available (256) 722-9985 _______________________________________________ rtems-devel mailing list rtems-devel@rtems.org http://www.rtems.org/mailman/listinfo/rtems-devel