We didn't get around to discussing this during today's public call, but we'll try to cover this during tomorrow's ARCH call.
As I noted earlier, the question of thread assignment to cores becomes complicated in virtual environment when it's not clear that a (virtual) core necessarily implies that there is any dedicated HW behind it. I think the simplest approach to take is simply to say that as long as the number of ODP threads is less than or equal to the number reported by odp_cpumask_all_available() and the number of control threads does not exceed odp_cpumask_control_default() and the number of worker threads does not exceed odp_cpumask_worker_default(), then the application can assume that each ODP thread will have its own CPU. If the thread count exceeds these numbers, then it is implementation-defined as to how ODP threads are multiplexed onto available CPUs in a fair manner. Applications that want best performance will adapt their thread usage to the number of CPUs available to it (subject to application-defined minimums, perhaps) to ensure that they don't have more threads than CPUs. If we have this convention then perhaps no additional APIs are needed to cover pinning/migration considerations? On Tue, May 10, 2016 at 8:04 AM, Yi He <[email protected]> wrote: > Hi, Petri > > While we can continue processor-related discussions in Bill's new > comprehensive email thread, about ODP-427 of how to guarantee locality of > odp_cpu_xxx() APIs, can we make a decision between two choices in > tomorrow's ARCH meeting? > > *Choice one: *constraint to ODP thread concept: every ODP thread will be > pinned on one CPU core. In this case, only the main thread was accidentally > not pinned on one core :), it is an ODP_THREAD_CONTROL, but is not > instantiated through odph_linux_pthread_create(). > > The solution can be: in odp_init_global() API, after > odp_cpumask_init_global(), pin the main thread to the 1st available core > for control threads. > > *Choice two: *in case to allow ODP thread migration between CPU cores, > new APIs are required to enable/disable CPU migration on the fly. (as patch > suggested). > > Let's talk in tomorrow. Thanks and Best Regards, Yi > > On 10 May 2016 at 04:54, Bill Fischofer <[email protected]> wrote: > >> >> >> On Mon, May 9, 2016 at 1:50 AM, Yi He <[email protected]> wrote: >> >>> Hi, Bill >>> >>> Thanks very much for your detailed explanation. I understand the >>> programming practise like: >>> >>> /* Firstly developer got a chance to specify core availabilities to the >>> * application instance. >>> */ >>> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ... ) >>> >>> *So It is possible to run an application with different core >>> availabilities spec on different platform, **and possible to run >>> multiple application instances on one platform in isolation.* >>> >>> *A: Make the above as command line parameters can help application >>> binary portable, run it on platform A or B requires no re-compilation, but >>> only invocation parameters change.* >>> >> >> The intent behind the ability to specify cpumasks at odp_init_global() >> time is to allow a launcher script that is configured by some provisioning >> agent (e.g., OpenDaylight) to communicate core assignments down to the ODP >> implementation in a platform-independent manner. So applications will fall >> into two categories, those that have provisioned coremasks that simply get >> passed through and more "stand alone" applications that will us >> odp_cpumask_all_available() and odp_cpumask_default_worker/control() as >> noted earlier to size themselves dynamically to the available processing >> resources. In both cases there is no need to recompile the application but >> rather to simply have it create an appropriate number of control/worker >> threads as determined either by external configuration or inquiry. >> >> >>> >>> /* Application developer fanout worker/control threads depends on >>> * the needs and actual availabilities. >>> */ >>> actually_available_cores = >>> odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers); >>> >>> iterator( actually_available_cores ) { >>> >>> /* Fanout one work thread instance */ >>> odph_linux_pthread_create(...upon one available core...); >>> } >>> >>> *B: Is odph_linux_pthread_create() a temporary helper API and will >>> converge into platform-independent odp_thread_create(..one core spec...) in >>> future? Or, is it deliberately left as platform dependant helper API?* >>> >>> Based on above understanding and back to ODP-427 problem, which seems >>> only the main thread (program entrance) was accidentally not pinned on one >>> core :), the main thread is also an ODP_THREAD_CONTROL, but was not >>> instantiated through odph_linux_pthread_create(). >>> >> >> ODP provides no APIs or helpers to control thread pinning. The only >> controls ODP provides is the ability to know the number of available cores, >> to partition them for use by worker and control threads, and the ability >> (via helpers) to create a number of threads of the application's choosing. >> The implementation is expected to schedule these threads to available cores >> in a fair manner, so if the number of application threads is less than or >> equal to the available number of cores then implementations SHOULD (but are >> not required to) pin each thread to its own core. Applications SHOULD NOT >> be designed to require or depend on any specify thread-to-core mapping both >> for portability as well as because what constitutes a "core" in a virtual >> environment may or may not represent dedicated hardware. >> >> >>> >>> A solution can be: in odp_init_global() API, after >>> odp_cpumask_init_global(), pin the main thread to the 1st available core >>> for control thread. This adds new behavioural specification to this API, >>> but seems natural. Actually Ivan's patch did most of this, except that the >>> core was fixed to 0. we can discuss in today's meeting. >>> >> >> An application may consist of more than a single thread at the time it >> calls odp_init_global(), however it is RECOMMENDED that odp_init_global() >> be called only from the application's initial thread and before it creates >> any other threads to avoid the address space confusion that has been the >> subject of the past couple of ARCH calls and that we are looking to achieve >> consensus on. I'd like to move that discussion to a separate discussion >> thread from this one, if you don't mind. >> >> >>> >>> Thanks and Best Regards, Yi >>> >>> On 6 May 2016 at 22:23, Bill Fischofer <[email protected]> >>> wrote: >>> >>>> These are all good questions. ODP divides threads into worker threads >>>> and control threads. The distinction is that worker threads are supposed to >>>> be performance sensitive and perform optimally with dedicated cores while >>>> control threads perform more "housekeeping" functions and would be less >>>> impacted by sharing cores. >>>> >>>> In the absence of explicit API calls, it is unspecified how an ODP >>>> implementation assigns threads to cores. The distinction between worker and >>>> control thread is a hint to the underlying implementation that should be >>>> used in managing available processor resources. >>>> >>>> The APIs in cpumask.h enable applications to determine how many CPUs >>>> are available to it and how to divide them among worker and control threads >>>> (odp_cpumask_default_worker() and odp_cpumask_default_control()). Note >>>> that ODP does not provide APIs for setting specific threads to specific >>>> CPUs, so keep that in mind in the answers below. >>>> >>>> >>>> On Thu, May 5, 2016 at 7:59 AM, Yi He <[email protected]> wrote: >>>> >>>>> Hi, thanks Bill >>>>> >>>>> I understand more deeply of ODP thread concept and in embedded app >>>>> developers are involved in target platform tuning/optimization. >>>>> >>>>> Can I have a little example: say we have a data-plane app which >>>>> includes 3 ODP threads. And would like to install and run it upon 2 >>>>> platforms. >>>>> >>>>> - Platform A: 2 cores. >>>>> - Platform B: 10 cores >>>>> >>>>> During initialization, the application can use >>>> odp_cpumask_all_available() to determine how many CPUs are available and >>>> can (optionally) use odp_cpumask_default_worker() and >>>> odp_cpumask_default_control() to divide them into CPUs that should be used >>>> for worker and control threads, respectively. For an application designed >>>> for scale-out, the number of available CPUs would typically be used to >>>> control how many worker threads the application creates. If the number of >>>> worker threads matches the number of worker CPUs then the ODP >>>> implementation would be expected to dedicate a worker core to each worker >>>> thread. If more threads are created than there are corresponding cores, >>>> then it is up to each implementation as to how it multiplexes them among >>>> the available cores in a fair manner. >>>> >>>> >>>>> Question, which one of the below assumptions is the current ODP >>>>> programming model? >>>>> >>>>> *1, *Application developer writes target platform specific code to >>>>> tell that: >>>>> >>>>> On platform A run threads (0) on core (0), and threads (1,2) on core >>>>> (1). >>>>> On platform B run threads (0) on core (0), and threads (1) can scale >>>>> out and duplicate 8 instances on core (1~8), and thread (2) on core (9). >>>>> >>>> >>>> As noted, ODP does not provide APIs that permit specific threads to be >>>> assigned to specific cores. Instead it is up to each ODP implementation as >>>> to how it maps ODP threads to available CPUs, subject to the advisory >>>> information provided by the ODP thread type and the cpumask assignments for >>>> control and worker threads. So in these examples suppose what the >>>> application has is two control threads and one or more workers. For >>>> Platform A you might have core 0 defined for control threads and Core 1 for >>>> worker threads. In this case threads 0 and 1 would run on Core 0 while >>>> thread 2 ran on Core 1. For Platform B it's again up to the application how >>>> it wants to divide the 10 CPUs between control and worker. It may want to >>>> have 2 control CPUs so that each control thread can have its own core, >>>> leaving 8 worker threads, or it might have the control threads share a >>>> single CPU and have 9 worker threads with their own cores. >>>> >>>> >>>>> >>>>> >>>> Install and run on different platform requires above platform specific >>>>> code and recompilation for target. >>>>> >>>> >>>> No. As noted, the model is the same. The only difference is how many >>>> control/worker threads the application chooses to create based on the >>>> information it gets during initialization by odp_cpumask_all_available(). >>>> >>>> >>>>> >>>>> *2, *Application developer writes code to specify: >>>>> >>>>> Threads (0, 2) would not scale out >>>>> Threads (1) can scale out (up to a limit N?) >>>>> Platform A has 3 cores available (as command line parameter?) >>>>> Platform B has 10 cores available (as command line parameter?) >>>>> >>>>> Install and run on different platform may not requires re-compilation. >>>>> ODP intelligently arrange the threads according to the information >>>>> provided. >>>>> >>>> >>>> Applications determine the minimum number of threads they require. For >>>> most applications they would tend to have a fixed number of control threads >>>> (based on the application's functional design) and a variable number of >>>> worker threads (minimum 1) based on available processing resources. These >>>> application-defined minimums determine the minimum configuration the >>>> application might need for optimal performance, with scale out to larger >>>> configurations performed automatically. >>>> >>>> >>>>> >>>>> Last question: in some case like power save mode available cores >>>>> shrink would ODP intelligently re-arrange the ODP threads dynamically in >>>>> runtime? >>>>> >>>> >>>> The intent is that while control threads may have distinct roles and >>>> responsibilities (thus requiring that all always be eligible to be >>>> scheduled) worker threads are symmetric and interchangeable. So in this >>>> case if I have N worker threads to match to the N available worker CPUs and >>>> power save mode wants to reduce that number to N-1, then the only effect is >>>> that the worker CPU entering power save mode goes dormant along with the >>>> thread that is running on it. That thread isn't redistributed to some other >>>> core because it's the same as the other worker threads. Its is expected >>>> that cores would only enter power save state at odp_schedule() boundaries. >>>> So for example, if odp_schedule() determines that there is no work to >>>> dispatch to this thread then that might trigger the associated CPU to enter >>>> low power mode. When later that core wakes up odp_schedule() would continue >>>> and then return work to its reactivated thread. >>>> >>>> A slight wrinkle here is the concept of scheduler groups, which allows >>>> work classes to be dispatched to different groups of worker threads. In >>>> this case the implementation might want to take scheduler group membership >>>> into consideration in determining which cores to idle for power savings. >>>> However, the ODP API itself is silent on this subject as it is >>>> implementation dependent how power save modes are managed. >>>> >>>> >>>>> >>>>> Thanks and Best Regards, Yi >>>>> >>>> >>>> Thank you for these questions. I answering them I realized we do not >>>> (yet) have this information covered in the ODP User Guide. I'll be using >>>> this information to help fill in that gap. >>>> >>>> >>>>> >>>>> On 5 May 2016 at 18:50, Bill Fischofer <[email protected]> >>>>> wrote: >>>>> >>>>>> I've added this to the agenda for Monday's call, however I suggest we >>>>>> continue the dialog here as well as background. >>>>>> >>>>>> Regarding thread pinning, there's always been a tradeoff on that. On >>>>>> the one hand dedicating cores to threads is ideal for scale out in many >>>>>> core systems, however ODP does not require many core environments to work >>>>>> effectively, so ODP APIs enable but do not require or assume that cores >>>>>> are >>>>>> dedicated to threads. That's really a question of application design and >>>>>> fit to the particular platform it's running on. In embedded environments >>>>>> you'll likely see this model more since the application knows which >>>>>> platform it's being targeted for. In VNF environments, by contrast, >>>>>> you're >>>>>> more likely to see a blend where applications will take advantage of >>>>>> however many cores are available to it but will still run without >>>>>> dedicated >>>>>> cores in environments with more modest resources. >>>>>> >>>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <[email protected]> wrote: >>>>>> >>>>>>> Hi, thanks Mike and Bill, >>>>>>> >>>>>>> From your clear summarize can we put it into several TO-DO >>>>>>> decisions: (we can have a discussion in next ARCH call): >>>>>>> >>>>>>> 1. How to addressing the precise semantics of the existing >>>>>>> timing APIs (odp_cpu_xxx) as they relate to processor locality. >>>>>>> >>>>>>> >>>>>>> - *A:* guarantee by adding constraint to ODP thread concept: >>>>>>> every ODP thread shall be deployed and pinned on one CPU core. >>>>>>> - A sub-question: my understanding is that application >>>>>>> programmers only need to specify available CPU sets for >>>>>>> control/worker >>>>>>> threads, and it is ODP to arrange the threads onto each CPU core >>>>>>> while >>>>>>> launching, right? >>>>>>> - *B*: guarantee by adding new APIs to disable/enable CPU >>>>>>> migration. >>>>>>> - Then document clearly in user's guide or API document. >>>>>>> >>>>>>> >>>>>>> 1. Understand the requirement to have both processor-local and >>>>>>> system-wide timing APIs: >>>>>>> >>>>>>> >>>>>>> - There are some APIs available in time.h (odp_time_local(), >>>>>>> etc). >>>>>>> - We can have a thread to understand the relationship, usage >>>>>>> scenarios and constraints of APIs in time.h and cpu.h. >>>>>>> >>>>>>> Best Regards, Yi >>>>>>> >>>>>>> On 4 May 2016 at 23:32, Bill Fischofer <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I think there are two fallouts form this discussion. First, there >>>>>>>> is the question of the precise semantics of the existing timing APIs as >>>>>>>> they relate to processor locality. Applications such as profiling >>>>>>>> tests, to >>>>>>>> the extent that they APIs that have processor-local semantics, must >>>>>>>> ensure >>>>>>>> that the thread(s) using these APIs are pinned for the duration of the >>>>>>>> measurement. >>>>>>>> >>>>>>>> The other point is the one that Petri brought up about having other >>>>>>>> APIs that provide timing information based on wall time or other >>>>>>>> metrics >>>>>>>> that are not processor-local. While these may not have the same >>>>>>>> performance characteristics, they would be independent of thread >>>>>>>> migration >>>>>>>> considerations. >>>>>>>> >>>>>>>> Of course all this depends on exactly what one is trying to >>>>>>>> measure. Since thread migration is not free, allowing such activity >>>>>>>> may or >>>>>>>> may not be relevant to what is being measured, so ODP probably wants to >>>>>>>> have both processor-local and systemwide timing APIs. We just need to >>>>>>>> be >>>>>>>> sure they are specified precisely so that applications know how to use >>>>>>>> them >>>>>>>> properly. >>>>>>>> >>>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> It sounded like the arch call was leaning towards documenting that >>>>>>>>> on odp-linux the application must ensure that odp_threads are pinned >>>>>>>>> to >>>>>>>>> cores when launched. >>>>>>>>> This is a restriction that some platforms may not need to make, vs >>>>>>>>> the idea that a piece of ODP code can use these APIs to ensure the >>>>>>>>> behavior >>>>>>>>> it needs without knowledge or reliance on the wider system. >>>>>>>>> >>>>>>>>> On 4 May 2016 at 01:45, Yi He <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> Establish a performance profiling environment guarantees >>>>>>>>>> meaningful >>>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx() >>>>>>>>>> APIs. >>>>>>>>>> While after profiling was done restore the execution environment >>>>>>>>>> to >>>>>>>>>> its multi-core optimized state. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Yi He <[email protected]> >>>>>>>>>> --- >>>>>>>>>> include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++ >>>>>>>>>> 1 file changed, 31 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/include/odp/api/spec/cpu.h >>>>>>>>>> b/include/odp/api/spec/cpu.h >>>>>>>>>> index 2789511..0bc9327 100644 >>>>>>>>>> --- a/include/odp/api/spec/cpu.h >>>>>>>>>> +++ b/include/odp/api/spec/cpu.h >>>>>>>>>> @@ -27,6 +27,21 @@ extern "C" { >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> /** >>>>>>>>>> + * @typedef odp_profiler_t >>>>>>>>>> + * ODP performance profiler handle >>>>>>>>>> + */ >>>>>>>>>> + >>>>>>>>>> +/** >>>>>>>>>> + * Setup a performance profiling environment >>>>>>>>>> + * >>>>>>>>>> + * A performance profiling environment guarantees meaningful and >>>>>>>>>> consistency of >>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs. >>>>>>>>>> + * >>>>>>>>>> + * @return performance profiler handle >>>>>>>>>> + */ >>>>>>>>>> +odp_profiler_t odp_profiler_start(void); >>>>>>>>>> + >>>>>>>>>> +/** >>>>>>>>>> * CPU identifier >>>>>>>>>> * >>>>>>>>>> * Determine CPU identifier on which the calling is running. CPU >>>>>>>>>> numbering is >>>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void); >>>>>>>>>> void odp_cpu_pause(void); >>>>>>>>>> >>>>>>>>>> /** >>>>>>>>>> + * Stop the performance profiling environment >>>>>>>>>> + * >>>>>>>>>> + * Stop performance profiling and restore the execution >>>>>>>>>> environment to its >>>>>>>>>> + * multi-core optimized state, won't preserve meaningful and >>>>>>>>>> consistency of >>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore. >>>>>>>>>> + * >>>>>>>>>> + * @param profiler performance profiler handle >>>>>>>>>> + * >>>>>>>>>> + * @retval 0 on success >>>>>>>>>> + * @retval <0 on failure >>>>>>>>>> + * >>>>>>>>>> + * @see odp_profiler_start() >>>>>>>>>> + */ >>>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler); >>>>>>>>>> + >>>>>>>>>> +/** >>>>>>>>>> * @} >>>>>>>>>> */ >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> 1.9.1 >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> lng-odp mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Mike Holmes >>>>>>>>> Technical Manager - Linaro Networking Group >>>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for >>>>>>>>> ARM SoCs >>>>>>>>> "Work should be fun and collaborative, the rest follows" >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> lng-odp mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ lng-odp mailing list [email protected] https://lists.linaro.org/mailman/listinfo/lng-odp
