Hi, Petri While we can continue processor-related discussions in Bill's new comprehensive email thread, about ODP-427 of how to guarantee locality of odp_cpu_xxx() APIs, can we make a decision between two choices in tomorrow's ARCH meeting?
*Choice one: *constraint to ODP thread concept: every ODP thread will be pinned on one CPU core. In this case, only the main thread was accidentally not pinned on one core :), it is an ODP_THREAD_CONTROL, but is not instantiated through odph_linux_pthread_create(). The solution can be: in odp_init_global() API, after odp_cpumask_init_global(), pin the main thread to the 1st available core for control threads. *Choice two: *in case to allow ODP thread migration between CPU cores, new APIs are required to enable/disable CPU migration on the fly. (as patch suggested). Let's talk in tomorrow. Thanks and Best Regards, Yi On 10 May 2016 at 04:54, Bill Fischofer <[email protected]> wrote: > > > On Mon, May 9, 2016 at 1:50 AM, Yi He <[email protected]> wrote: > >> Hi, Bill >> >> Thanks very much for your detailed explanation. I understand the >> programming practise like: >> >> /* Firstly developer got a chance to specify core availabilities to the >> * application instance. >> */ >> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ... ) >> >> *So It is possible to run an application with different core >> availabilities spec on different platform, **and possible to run >> multiple application instances on one platform in isolation.* >> >> *A: Make the above as command line parameters can help application binary >> portable, run it on platform A or B requires no re-compilation, but only >> invocation parameters change.* >> > > The intent behind the ability to specify cpumasks at odp_init_global() > time is to allow a launcher script that is configured by some provisioning > agent (e.g., OpenDaylight) to communicate core assignments down to the ODP > implementation in a platform-independent manner. So applications will fall > into two categories, those that have provisioned coremasks that simply get > passed through and more "stand alone" applications that will us > odp_cpumask_all_available() and odp_cpumask_default_worker/control() as > noted earlier to size themselves dynamically to the available processing > resources. In both cases there is no need to recompile the application but > rather to simply have it create an appropriate number of control/worker > threads as determined either by external configuration or inquiry. > > >> >> /* Application developer fanout worker/control threads depends on >> * the needs and actual availabilities. >> */ >> actually_available_cores = >> odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers); >> >> iterator( actually_available_cores ) { >> >> /* Fanout one work thread instance */ >> odph_linux_pthread_create(...upon one available core...); >> } >> >> *B: Is odph_linux_pthread_create() a temporary helper API and will >> converge into platform-independent odp_thread_create(..one core spec...) in >> future? Or, is it deliberately left as platform dependant helper API?* >> >> Based on above understanding and back to ODP-427 problem, which seems >> only the main thread (program entrance) was accidentally not pinned on one >> core :), the main thread is also an ODP_THREAD_CONTROL, but was not >> instantiated through odph_linux_pthread_create(). >> > > ODP provides no APIs or helpers to control thread pinning. The only > controls ODP provides is the ability to know the number of available cores, > to partition them for use by worker and control threads, and the ability > (via helpers) to create a number of threads of the application's choosing. > The implementation is expected to schedule these threads to available cores > in a fair manner, so if the number of application threads is less than or > equal to the available number of cores then implementations SHOULD (but are > not required to) pin each thread to its own core. Applications SHOULD NOT > be designed to require or depend on any specify thread-to-core mapping both > for portability as well as because what constitutes a "core" in a virtual > environment may or may not represent dedicated hardware. > > >> >> A solution can be: in odp_init_global() API, after >> odp_cpumask_init_global(), pin the main thread to the 1st available core >> for control thread. This adds new behavioural specification to this API, >> but seems natural. Actually Ivan's patch did most of this, except that the >> core was fixed to 0. we can discuss in today's meeting. >> > > An application may consist of more than a single thread at the time it > calls odp_init_global(), however it is RECOMMENDED that odp_init_global() > be called only from the application's initial thread and before it creates > any other threads to avoid the address space confusion that has been the > subject of the past couple of ARCH calls and that we are looking to achieve > consensus on. I'd like to move that discussion to a separate discussion > thread from this one, if you don't mind. > > >> >> Thanks and Best Regards, Yi >> >> On 6 May 2016 at 22:23, Bill Fischofer <[email protected]> wrote: >> >>> These are all good questions. ODP divides threads into worker threads >>> and control threads. The distinction is that worker threads are supposed to >>> be performance sensitive and perform optimally with dedicated cores while >>> control threads perform more "housekeeping" functions and would be less >>> impacted by sharing cores. >>> >>> In the absence of explicit API calls, it is unspecified how an ODP >>> implementation assigns threads to cores. The distinction between worker and >>> control thread is a hint to the underlying implementation that should be >>> used in managing available processor resources. >>> >>> The APIs in cpumask.h enable applications to determine how many CPUs are >>> available to it and how to divide them among worker and control threads >>> (odp_cpumask_default_worker() and odp_cpumask_default_control()). Note >>> that ODP does not provide APIs for setting specific threads to specific >>> CPUs, so keep that in mind in the answers below. >>> >>> >>> On Thu, May 5, 2016 at 7:59 AM, Yi He <[email protected]> wrote: >>> >>>> Hi, thanks Bill >>>> >>>> I understand more deeply of ODP thread concept and in embedded app >>>> developers are involved in target platform tuning/optimization. >>>> >>>> Can I have a little example: say we have a data-plane app which >>>> includes 3 ODP threads. And would like to install and run it upon 2 >>>> platforms. >>>> >>>> - Platform A: 2 cores. >>>> - Platform B: 10 cores >>>> >>>> During initialization, the application can use >>> odp_cpumask_all_available() to determine how many CPUs are available and >>> can (optionally) use odp_cpumask_default_worker() and >>> odp_cpumask_default_control() to divide them into CPUs that should be used >>> for worker and control threads, respectively. For an application designed >>> for scale-out, the number of available CPUs would typically be used to >>> control how many worker threads the application creates. If the number of >>> worker threads matches the number of worker CPUs then the ODP >>> implementation would be expected to dedicate a worker core to each worker >>> thread. If more threads are created than there are corresponding cores, >>> then it is up to each implementation as to how it multiplexes them among >>> the available cores in a fair manner. >>> >>> >>>> Question, which one of the below assumptions is the current ODP >>>> programming model? >>>> >>>> *1, *Application developer writes target platform specific code to >>>> tell that: >>>> >>>> On platform A run threads (0) on core (0), and threads (1,2) on core >>>> (1). >>>> On platform B run threads (0) on core (0), and threads (1) can scale >>>> out and duplicate 8 instances on core (1~8), and thread (2) on core (9). >>>> >>> >>> As noted, ODP does not provide APIs that permit specific threads to be >>> assigned to specific cores. Instead it is up to each ODP implementation as >>> to how it maps ODP threads to available CPUs, subject to the advisory >>> information provided by the ODP thread type and the cpumask assignments for >>> control and worker threads. So in these examples suppose what the >>> application has is two control threads and one or more workers. For >>> Platform A you might have core 0 defined for control threads and Core 1 for >>> worker threads. In this case threads 0 and 1 would run on Core 0 while >>> thread 2 ran on Core 1. For Platform B it's again up to the application how >>> it wants to divide the 10 CPUs between control and worker. It may want to >>> have 2 control CPUs so that each control thread can have its own core, >>> leaving 8 worker threads, or it might have the control threads share a >>> single CPU and have 9 worker threads with their own cores. >>> >>> >>>> >>>> >>> Install and run on different platform requires above platform specific >>>> code and recompilation for target. >>>> >>> >>> No. As noted, the model is the same. The only difference is how many >>> control/worker threads the application chooses to create based on the >>> information it gets during initialization by odp_cpumask_all_available(). >>> >>> >>>> >>>> *2, *Application developer writes code to specify: >>>> >>>> Threads (0, 2) would not scale out >>>> Threads (1) can scale out (up to a limit N?) >>>> Platform A has 3 cores available (as command line parameter?) >>>> Platform B has 10 cores available (as command line parameter?) >>>> >>>> Install and run on different platform may not requires re-compilation. >>>> ODP intelligently arrange the threads according to the information >>>> provided. >>>> >>> >>> Applications determine the minimum number of threads they require. For >>> most applications they would tend to have a fixed number of control threads >>> (based on the application's functional design) and a variable number of >>> worker threads (minimum 1) based on available processing resources. These >>> application-defined minimums determine the minimum configuration the >>> application might need for optimal performance, with scale out to larger >>> configurations performed automatically. >>> >>> >>>> >>>> Last question: in some case like power save mode available cores shrink >>>> would ODP intelligently re-arrange the ODP threads dynamically in runtime? >>>> >>> >>> The intent is that while control threads may have distinct roles and >>> responsibilities (thus requiring that all always be eligible to be >>> scheduled) worker threads are symmetric and interchangeable. So in this >>> case if I have N worker threads to match to the N available worker CPUs and >>> power save mode wants to reduce that number to N-1, then the only effect is >>> that the worker CPU entering power save mode goes dormant along with the >>> thread that is running on it. That thread isn't redistributed to some other >>> core because it's the same as the other worker threads. Its is expected >>> that cores would only enter power save state at odp_schedule() boundaries. >>> So for example, if odp_schedule() determines that there is no work to >>> dispatch to this thread then that might trigger the associated CPU to enter >>> low power mode. When later that core wakes up odp_schedule() would continue >>> and then return work to its reactivated thread. >>> >>> A slight wrinkle here is the concept of scheduler groups, which allows >>> work classes to be dispatched to different groups of worker threads. In >>> this case the implementation might want to take scheduler group membership >>> into consideration in determining which cores to idle for power savings. >>> However, the ODP API itself is silent on this subject as it is >>> implementation dependent how power save modes are managed. >>> >>> >>>> >>>> Thanks and Best Regards, Yi >>>> >>> >>> Thank you for these questions. I answering them I realized we do not >>> (yet) have this information covered in the ODP User Guide. I'll be using >>> this information to help fill in that gap. >>> >>> >>>> >>>> On 5 May 2016 at 18:50, Bill Fischofer <[email protected]> >>>> wrote: >>>> >>>>> I've added this to the agenda for Monday's call, however I suggest we >>>>> continue the dialog here as well as background. >>>>> >>>>> Regarding thread pinning, there's always been a tradeoff on that. On >>>>> the one hand dedicating cores to threads is ideal for scale out in many >>>>> core systems, however ODP does not require many core environments to work >>>>> effectively, so ODP APIs enable but do not require or assume that cores >>>>> are >>>>> dedicated to threads. That's really a question of application design and >>>>> fit to the particular platform it's running on. In embedded environments >>>>> you'll likely see this model more since the application knows which >>>>> platform it's being targeted for. In VNF environments, by contrast, you're >>>>> more likely to see a blend where applications will take advantage of >>>>> however many cores are available to it but will still run without >>>>> dedicated >>>>> cores in environments with more modest resources. >>>>> >>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <[email protected]> wrote: >>>>> >>>>>> Hi, thanks Mike and Bill, >>>>>> >>>>>> From your clear summarize can we put it into several TO-DO decisions: >>>>>> (we can have a discussion in next ARCH call): >>>>>> >>>>>> 1. How to addressing the precise semantics of the existing timing >>>>>> APIs (odp_cpu_xxx) as they relate to processor locality. >>>>>> >>>>>> >>>>>> - *A:* guarantee by adding constraint to ODP thread concept: >>>>>> every ODP thread shall be deployed and pinned on one CPU core. >>>>>> - A sub-question: my understanding is that application >>>>>> programmers only need to specify available CPU sets for >>>>>> control/worker >>>>>> threads, and it is ODP to arrange the threads onto each CPU core >>>>>> while >>>>>> launching, right? >>>>>> - *B*: guarantee by adding new APIs to disable/enable CPU >>>>>> migration. >>>>>> - Then document clearly in user's guide or API document. >>>>>> >>>>>> >>>>>> 1. Understand the requirement to have both processor-local and >>>>>> system-wide timing APIs: >>>>>> >>>>>> >>>>>> - There are some APIs available in time.h (odp_time_local(), etc). >>>>>> - We can have a thread to understand the relationship, usage >>>>>> scenarios and constraints of APIs in time.h and cpu.h. >>>>>> >>>>>> Best Regards, Yi >>>>>> >>>>>> On 4 May 2016 at 23:32, Bill Fischofer <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I think there are two fallouts form this discussion. First, there >>>>>>> is the question of the precise semantics of the existing timing APIs as >>>>>>> they relate to processor locality. Applications such as profiling >>>>>>> tests, to >>>>>>> the extent that they APIs that have processor-local semantics, must >>>>>>> ensure >>>>>>> that the thread(s) using these APIs are pinned for the duration of the >>>>>>> measurement. >>>>>>> >>>>>>> The other point is the one that Petri brought up about having other >>>>>>> APIs that provide timing information based on wall time or other metrics >>>>>>> that are not processor-local. While these may not have the same >>>>>>> performance characteristics, they would be independent of thread >>>>>>> migration >>>>>>> considerations. >>>>>>> >>>>>>> Of course all this depends on exactly what one is trying to measure. >>>>>>> Since thread migration is not free, allowing such activity may or may >>>>>>> not >>>>>>> be relevant to what is being measured, so ODP probably wants to have >>>>>>> both >>>>>>> processor-local and systemwide timing APIs. We just need to be sure >>>>>>> they >>>>>>> are specified precisely so that applications know how to use them >>>>>>> properly. >>>>>>> >>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> It sounded like the arch call was leaning towards documenting that >>>>>>>> on odp-linux the application must ensure that odp_threads are pinned >>>>>>>> to >>>>>>>> cores when launched. >>>>>>>> This is a restriction that some platforms may not need to make, vs >>>>>>>> the idea that a piece of ODP code can use these APIs to ensure the >>>>>>>> behavior >>>>>>>> it needs without knowledge or reliance on the wider system. >>>>>>>> >>>>>>>> On 4 May 2016 at 01:45, Yi He <[email protected]> wrote: >>>>>>>> >>>>>>>>> Establish a performance profiling environment guarantees meaningful >>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx() >>>>>>>>> APIs. >>>>>>>>> While after profiling was done restore the execution environment to >>>>>>>>> its multi-core optimized state. >>>>>>>>> >>>>>>>>> Signed-off-by: Yi He <[email protected]> >>>>>>>>> --- >>>>>>>>> include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++ >>>>>>>>> 1 file changed, 31 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/include/odp/api/spec/cpu.h >>>>>>>>> b/include/odp/api/spec/cpu.h >>>>>>>>> index 2789511..0bc9327 100644 >>>>>>>>> --- a/include/odp/api/spec/cpu.h >>>>>>>>> +++ b/include/odp/api/spec/cpu.h >>>>>>>>> @@ -27,6 +27,21 @@ extern "C" { >>>>>>>>> >>>>>>>>> >>>>>>>>> /** >>>>>>>>> + * @typedef odp_profiler_t >>>>>>>>> + * ODP performance profiler handle >>>>>>>>> + */ >>>>>>>>> + >>>>>>>>> +/** >>>>>>>>> + * Setup a performance profiling environment >>>>>>>>> + * >>>>>>>>> + * A performance profiling environment guarantees meaningful and >>>>>>>>> consistency of >>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs. >>>>>>>>> + * >>>>>>>>> + * @return performance profiler handle >>>>>>>>> + */ >>>>>>>>> +odp_profiler_t odp_profiler_start(void); >>>>>>>>> + >>>>>>>>> +/** >>>>>>>>> * CPU identifier >>>>>>>>> * >>>>>>>>> * Determine CPU identifier on which the calling is running. CPU >>>>>>>>> numbering is >>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void); >>>>>>>>> void odp_cpu_pause(void); >>>>>>>>> >>>>>>>>> /** >>>>>>>>> + * Stop the performance profiling environment >>>>>>>>> + * >>>>>>>>> + * Stop performance profiling and restore the execution >>>>>>>>> environment to its >>>>>>>>> + * multi-core optimized state, won't preserve meaningful and >>>>>>>>> consistency of >>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore. >>>>>>>>> + * >>>>>>>>> + * @param profiler performance profiler handle >>>>>>>>> + * >>>>>>>>> + * @retval 0 on success >>>>>>>>> + * @retval <0 on failure >>>>>>>>> + * >>>>>>>>> + * @see odp_profiler_start() >>>>>>>>> + */ >>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler); >>>>>>>>> + >>>>>>>>> +/** >>>>>>>>> * @} >>>>>>>>> */ >>>>>>>>> >>>>>>>>> -- >>>>>>>>> 1.9.1 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> lng-odp mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Mike Holmes >>>>>>>> Technical Manager - Linaro Networking Group >>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for >>>>>>>> ARM SoCs >>>>>>>> "Work should be fun and collaborative, the rest follows" >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> lng-odp mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ lng-odp mailing list [email protected] https://lists.linaro.org/mailman/listinfo/lng-odp
