stop

Bill Fischofer Tue, 10 May 2016 17:11:34 -0700

We didn't get around to discussing this during today's public call, but
we'll try to cover this during tomorrow's ARCH call.


As I noted earlier, the question of thread assignment to cores becomes
complicated in virtual environment when it's not clear that a (virtual)
core necessarily implies that there is any dedicated HW behind it. I think
the simplest approach to take is simply to say that as long as the number
of ODP threads is less than or equal to the number reported by
odp_cpumask_all_available() and the number of control threads does not
exceed odp_cpumask_control_default() and the number of worker threads does
not exceed odp_cpumask_worker_default(), then the application can assume
that each ODP thread will have its own CPU. If the thread count exceeds
these numbers, then it is implementation-defined as to how ODP threads are
multiplexed onto available CPUs in a fair manner.

Applications that want best performance will adapt their thread usage to
the number of CPUs available to it (subject to application-defined
minimums, perhaps) to ensure that they don't have more threads than CPUs.

If we have this convention then perhaps no additional APIs are needed to
cover pinning/migration considerations?

On Tue, May 10, 2016 at 8:04 AM, Yi He <[email protected]> wrote:

> Hi, Petri
>
> While we can continue processor-related discussions in Bill's new
> comprehensive email thread, about ODP-427 of how to guarantee locality of
> odp_cpu_xxx() APIs, can we make a decision between two choices in
> tomorrow's ARCH meeting?
>
> *Choice one: *constraint to ODP thread concept: every ODP thread will be
> pinned on one CPU core. In this case, only the main thread was accidentally
> not pinned on one core :), it is an ODP_THREAD_CONTROL, but is not
> instantiated through odph_linux_pthread_create().
>
> The solution can be: in odp_init_global() API, after
> odp_cpumask_init_global(), pin the main thread to the 1st available core
> for control threads.
>
> *Choice two: *in case to allow ODP thread migration between CPU cores,
> new APIs are required to enable/disable CPU migration on the fly. (as patch
> suggested).
>
> Let's talk in tomorrow. Thanks and Best Regards, Yi
>
> On 10 May 2016 at 04:54, Bill Fischofer <[email protected]> wrote:
>
>>
>>
>> On Mon, May 9, 2016 at 1:50 AM, Yi He <[email protected]> wrote:
>>
>>> Hi, Bill
>>>
>>> Thanks very much for your detailed explanation. I understand the
>>> programming practise like:
>>>
>>> /* Firstly developer got a chance to specify core availabilities to the
>>>  * application instance.
>>>  */
>>> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ... )
>>>
>>> *So It is possible to run an application with different core
>>> availabilities spec on different platform, **and possible to run
>>> multiple application instances on one platform in isolation.*
>>>
>>> *A: Make the above as command line parameters can help application
>>> binary portable, run it on platform A or B requires no re-compilation, but
>>> only invocation parameters change.*
>>>
>>
>> The intent behind the ability to specify cpumasks at odp_init_global()
>> time is to allow a launcher script that is configured by some provisioning
>> agent (e.g., OpenDaylight) to communicate core assignments down to the ODP
>> implementation in a platform-independent manner.  So applications will fall
>> into two categories, those that have provisioned coremasks that simply get
>> passed through and more "stand alone" applications that will us
>> odp_cpumask_all_available() and odp_cpumask_default_worker/control() as
>> noted earlier to size themselves dynamically to the available processing
>> resources.  In both cases there is no need to recompile the application but
>> rather to simply have it create an appropriate number of control/worker
>> threads as determined either by external configuration or inquiry.
>>
>>
>>>
>>> /* Application developer fanout worker/control threads depends on
>>>  * the needs and actual availabilities.
>>>  */
>>> actually_available_cores =
>>>     odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);
>>>
>>> iterator( actually_available_cores ) {
>>>
>>>     /* Fanout one work thread instance */
>>>     odph_linux_pthread_create(...upon one available core...);
>>> }
>>>
>>> *B: Is odph_linux_pthread_create() a temporary helper API and will
>>> converge into platform-independent odp_thread_create(..one core spec...) in
>>> future? Or, is it deliberately left as platform dependant helper API?*
>>>
>>> Based on above understanding and back to ODP-427 problem, which seems
>>> only the main thread (program entrance) was accidentally not pinned on one
>>> core :), the main thread is also an ODP_THREAD_CONTROL, but was not
>>> instantiated through odph_linux_pthread_create().
>>>
>>
>> ODP provides no APIs or helpers to control thread pinning. The only
>> controls ODP provides is the ability to know the number of available cores,
>> to partition them for use by worker and control threads, and the ability
>> (via helpers) to create a number of threads of the application's choosing.
>> The implementation is expected to schedule these threads to available cores
>> in a fair manner, so if the number of application threads is less than or
>> equal to the available number of cores then implementations SHOULD (but are
>> not required to) pin each thread to its own core. Applications SHOULD NOT
>> be designed to require or depend on any specify thread-to-core mapping both
>> for portability as well as because what constitutes a "core" in a virtual
>> environment may or may not represent dedicated hardware.
>>
>>
>>>
>>> A solution can be: in odp_init_global() API, after
>>> odp_cpumask_init_global(), pin the main thread to the 1st available core
>>> for control thread. This adds new behavioural specification to this API,
>>> but seems natural. Actually Ivan's patch did most of this, except that the
>>> core was fixed to 0. we can discuss in today's meeting.
>>>
>>
>> An application may consist of more than a single thread at the time it
>> calls odp_init_global(), however it is RECOMMENDED that odp_init_global()
>> be called only from the application's initial thread and before it creates
>> any other threads to avoid the address space confusion that has been the
>> subject of the past couple of ARCH calls and that we are looking to achieve
>> consensus on. I'd like to move that discussion to a separate discussion
>> thread from this one, if you don't mind.
>>
>>
>>>
>>> Thanks and Best Regards, Yi
>>>
>>> On 6 May 2016 at 22:23, Bill Fischofer <[email protected]>
>>> wrote:
>>>
>>>> These are all good questions. ODP divides threads into worker threads
>>>> and control threads. The distinction is that worker threads are supposed to
>>>> be performance sensitive and perform optimally with dedicated cores while
>>>> control threads perform more "housekeeping" functions and would be less
>>>> impacted by sharing cores.
>>>>
>>>> In the absence of explicit API calls, it is unspecified how an ODP
>>>> implementation assigns threads to cores. The distinction between worker and
>>>> control thread is a hint to the underlying implementation that should be
>>>> used in managing available processor resources.
>>>>
>>>> The APIs in cpumask.h enable applications to determine how many CPUs
>>>> are available to it and how to divide them among worker and control threads
>>>> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note
>>>> that ODP does not provide APIs for setting specific threads to specific
>>>> CPUs, so keep that in mind in the answers below.
>>>>
>>>>
>>>> On Thu, May 5, 2016 at 7:59 AM, Yi He <[email protected]> wrote:
>>>>
>>>>> Hi, thanks Bill
>>>>>
>>>>> I understand more deeply of ODP thread concept and in embedded app
>>>>> developers are involved in target platform tuning/optimization.
>>>>>
>>>>> Can I have a little example: say we have a data-plane app which
>>>>> includes 3 ODP threads. And would like to install and run it upon 2
>>>>> platforms.
>>>>>
>>>>>    - Platform A: 2 cores.
>>>>>    - Platform B: 10 cores
>>>>>
>>>>> During initialization, the application can use
>>>> odp_cpumask_all_available() to determine how many CPUs are available and
>>>> can (optionally) use odp_cpumask_default_worker() and
>>>> odp_cpumask_default_control() to divide them into CPUs that should be used
>>>> for worker and control threads, respectively. For an application designed
>>>> for scale-out, the number of available CPUs would typically be used to
>>>> control how many worker threads the application creates. If the number of
>>>> worker threads matches the number of worker CPUs then the ODP
>>>> implementation would be expected to dedicate a worker core to each worker
>>>> thread. If more threads are created than there are corresponding cores,
>>>> then it is up to each implementation as to how it multiplexes them among
>>>> the available cores in a fair manner.
>>>>
>>>>
>>>>> Question, which one of the below assumptions is the current ODP
>>>>> programming model?
>>>>>
>>>>> *1, *Application developer writes target platform specific code to
>>>>> tell that:
>>>>>
>>>>> On platform A run threads (0) on core (0), and threads (1,2) on core
>>>>> (1).
>>>>> On platform B run threads (0) on core (0), and threads (1) can scale
>>>>> out and duplicate 8 instances on core (1~8), and thread (2) on core (9).
>>>>>
>>>>
>>>> As noted, ODP does not provide APIs that permit specific threads to be
>>>> assigned to specific cores. Instead it is up to each ODP implementation as
>>>> to how it maps ODP threads to available CPUs, subject to the advisory
>>>> information provided by the ODP thread type and the cpumask assignments for
>>>> control and worker threads. So in these examples suppose what the
>>>> application has is two control threads and one or more workers.  For
>>>> Platform A you might have core 0 defined for control threads and Core 1 for
>>>> worker threads. In this case threads 0 and 1 would run on Core 0 while
>>>> thread 2 ran on Core 1. For Platform B it's again up to the application how
>>>> it wants to divide the 10 CPUs between control and worker. It may want to
>>>> have 2 control CPUs so that each control thread can have its own core,
>>>> leaving 8 worker threads, or it might have the control threads share a
>>>> single CPU and have 9 worker threads with their own cores.
>>>>
>>>>
>>>>>
>>>>>
>>>> Install and run on different platform requires above platform specific
>>>>> code and recompilation for target.
>>>>>
>>>>
>>>> No. As noted, the model is the same. The only difference is how many
>>>> control/worker threads the application chooses to create based on the
>>>> information it gets during initialization by odp_cpumask_all_available().
>>>>
>>>>
>>>>>
>>>>> *2, *Application developer writes code to specify:
>>>>>
>>>>> Threads (0, 2) would not scale out
>>>>> Threads (1) can scale out (up to a limit N?)
>>>>> Platform A has 3 cores available (as command line parameter?)
>>>>> Platform B has 10 cores available (as command line parameter?)
>>>>>
>>>>> Install and run on different platform may not requires re-compilation.
>>>>> ODP intelligently arrange the threads according to the information
>>>>> provided.
>>>>>
>>>>
>>>> Applications determine the minimum number of threads they require. For
>>>> most applications they would tend to have a fixed number of control threads
>>>> (based on the application's functional design) and a variable number of
>>>> worker threads (minimum 1) based on available processing resources. These
>>>> application-defined minimums determine the minimum configuration the
>>>> application might need for optimal performance, with scale out to larger
>>>> configurations performed automatically.
>>>>
>>>>
>>>>>
>>>>> Last question: in some case like power save mode available cores
>>>>> shrink would ODP intelligently re-arrange the ODP threads dynamically in
>>>>> runtime?
>>>>>
>>>>
>>>> The intent is that while control threads may have distinct roles and
>>>> responsibilities (thus requiring that all always be eligible to be
>>>> scheduled) worker threads are symmetric and interchangeable. So in this
>>>> case if I have N worker threads to match to the N available worker CPUs and
>>>> power save mode wants to reduce that number to N-1, then the only effect is
>>>> that the worker CPU entering power save mode goes dormant along with the
>>>> thread that is running on it. That thread isn't redistributed to some other
>>>> core because it's the same as the other worker threads.  Its is expected
>>>> that cores would only enter power save state at odp_schedule() boundaries.
>>>> So for example, if odp_schedule() determines that there is no work to
>>>> dispatch to this thread then that might trigger the associated CPU to enter
>>>> low power mode. When later that core wakes up odp_schedule() would continue
>>>> and then return work to its reactivated thread.
>>>>
>>>> A slight wrinkle here is the concept of scheduler groups, which allows
>>>> work classes to be dispatched to different groups of worker threads.  In
>>>> this case the implementation might want to take scheduler group membership
>>>> into consideration in determining which cores to idle for power savings.
>>>> However, the ODP API itself is silent on this subject as it is
>>>> implementation dependent how power save modes are managed.
>>>>
>>>>
>>>>>
>>>>> Thanks and Best Regards, Yi
>>>>>
>>>>
>>>> Thank you for these questions. I answering them I realized we do not
>>>> (yet) have this information covered in the ODP User Guide. I'll be using
>>>> this information to help fill in that gap.
>>>>
>>>>
>>>>>
>>>>> On 5 May 2016 at 18:50, Bill Fischofer <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I've added this to the agenda for Monday's call, however I suggest we
>>>>>> continue the dialog here as well as background.
>>>>>>
>>>>>> Regarding thread pinning, there's always been a tradeoff on that.  On
>>>>>> the one hand dedicating cores to threads is ideal for scale out in many
>>>>>> core systems, however ODP does not require many core environments to work
>>>>>> effectively, so ODP APIs enable but do not require or assume that cores 
>>>>>> are
>>>>>> dedicated to threads. That's really a question of application design and
>>>>>> fit to the particular platform it's running on. In embedded environments
>>>>>> you'll likely see this model more since the application knows which
>>>>>> platform it's being targeted for. In VNF environments, by contrast, 
>>>>>> you're
>>>>>> more likely to see a blend where applications will take advantage of
>>>>>> however many cores are available to it but will still run without 
>>>>>> dedicated
>>>>>> cores in environments with more modest resources.
>>>>>>
>>>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <[email protected]> wrote:
>>>>>>
>>>>>>> Hi, thanks Mike and Bill,
>>>>>>>
>>>>>>> From your clear summarize can we put it into several TO-DO
>>>>>>> decisions: (we can have a discussion in next ARCH call):
>>>>>>>
>>>>>>>    1. How to addressing the precise semantics of the existing
>>>>>>>    timing APIs (odp_cpu_xxx) as they relate to processor locality.
>>>>>>>
>>>>>>>
>>>>>>>    - *A:* guarantee by adding constraint to ODP thread concept:
>>>>>>>    every ODP thread shall be deployed and pinned on one CPU core.
>>>>>>>       - A sub-question: my understanding is that application
>>>>>>>       programmers only need to specify available CPU sets for 
>>>>>>> control/worker
>>>>>>>       threads, and it is ODP to arrange the threads onto each CPU core 
>>>>>>> while
>>>>>>>       launching, right?
>>>>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU
>>>>>>>    migration.
>>>>>>>    - Then document clearly in user's guide or API document.
>>>>>>>
>>>>>>>
>>>>>>>    1. Understand the requirement to have both processor-local and
>>>>>>>    system-wide timing APIs:
>>>>>>>
>>>>>>>
>>>>>>>    - There are some APIs available in time.h (odp_time_local(),
>>>>>>>    etc).
>>>>>>>    - We can have a thread to understand the relationship, usage
>>>>>>>    scenarios and constraints of APIs in time.h and cpu.h.
>>>>>>>
>>>>>>> Best Regards, Yi
>>>>>>>
>>>>>>> On 4 May 2016 at 23:32, Bill Fischofer <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think there are two fallouts form this discussion.  First, there
>>>>>>>> is the question of the precise semantics of the existing timing APIs as
>>>>>>>> they relate to processor locality. Applications such as profiling 
>>>>>>>> tests, to
>>>>>>>> the extent that they APIs that have processor-local semantics, must 
>>>>>>>> ensure
>>>>>>>> that the thread(s) using these APIs are pinned for the duration of the
>>>>>>>> measurement.
>>>>>>>>
>>>>>>>> The other point is the one that Petri brought up about having other
>>>>>>>> APIs that provide timing information based on wall time or other 
>>>>>>>> metrics
>>>>>>>> that are not processor-local.  While these may not have the same
>>>>>>>> performance characteristics, they would be independent of thread 
>>>>>>>> migration
>>>>>>>> considerations.
>>>>>>>>
>>>>>>>> Of course all this depends on exactly what one is trying to
>>>>>>>> measure. Since thread migration is not free, allowing such activity 
>>>>>>>> may or
>>>>>>>> may not be relevant to what is being measured, so ODP probably wants to
>>>>>>>> have both processor-local and systemwide timing APIs.  We just need to 
>>>>>>>> be
>>>>>>>> sure they are specified precisely so that applications know how to use 
>>>>>>>> them
>>>>>>>> properly.
>>>>>>>>
>>>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> It sounded like the arch call was leaning towards documenting that
>>>>>>>>> on odp-linux  the application must ensure that odp_threads are pinned 
>>>>>>>>> to
>>>>>>>>> cores when launched.
>>>>>>>>> This is a restriction that some platforms may not need to make, vs
>>>>>>>>> the idea that a piece of ODP code can use these APIs to ensure the 
>>>>>>>>> behavior
>>>>>>>>> it needs without knowledge or reliance on the wider system.
>>>>>>>>>
>>>>>>>>> On 4 May 2016 at 01:45, Yi He <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Establish a performance profiling environment guarantees
>>>>>>>>>> meaningful
>>>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx()
>>>>>>>>>> APIs.
>>>>>>>>>> While after profiling was done restore the execution environment
>>>>>>>>>> to
>>>>>>>>>> its multi-core optimized state.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Yi He <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++
>>>>>>>>>>  1 file changed, 31 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/odp/api/spec/cpu.h
>>>>>>>>>> b/include/odp/api/spec/cpu.h
>>>>>>>>>> index 2789511..0bc9327 100644
>>>>>>>>>> --- a/include/odp/api/spec/cpu.h
>>>>>>>>>> +++ b/include/odp/api/spec/cpu.h
>>>>>>>>>> @@ -27,6 +27,21 @@ extern "C" {
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  /**
>>>>>>>>>> + * @typedef odp_profiler_t
>>>>>>>>>> + * ODP performance profiler handle
>>>>>>>>>> + */
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * Setup a performance profiling environment
>>>>>>>>>> + *
>>>>>>>>>> + * A performance profiling environment guarantees meaningful and
>>>>>>>>>> consistency of
>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.
>>>>>>>>>> + *
>>>>>>>>>> + * @return performance profiler handle
>>>>>>>>>> + */
>>>>>>>>>> +odp_profiler_t odp_profiler_start(void);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>>   * CPU identifier
>>>>>>>>>>   *
>>>>>>>>>>   * Determine CPU identifier on which the calling is running. CPU
>>>>>>>>>> numbering is
>>>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);
>>>>>>>>>>  void odp_cpu_pause(void);
>>>>>>>>>>
>>>>>>>>>>  /**
>>>>>>>>>> + * Stop the performance profiling environment
>>>>>>>>>> + *
>>>>>>>>>> + * Stop performance profiling and restore the execution
>>>>>>>>>> environment to its
>>>>>>>>>> + * multi-core optimized state, won't preserve meaningful and
>>>>>>>>>> consistency of
>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.
>>>>>>>>>> + *
>>>>>>>>>> + * @param profiler  performance profiler handle
>>>>>>>>>> + *
>>>>>>>>>> + * @retval 0 on success
>>>>>>>>>> + * @retval <0 on failure
>>>>>>>>>> + *
>>>>>>>>>> + * @see odp_profiler_start()
>>>>>>>>>> + */
>>>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>>   * @}
>>>>>>>>>>   */
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> 1.9.1
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> lng-odp mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Mike Holmes
>>>>>>>>> Technical Manager - Linaro Networking Group
>>>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for
>>>>>>>>> ARM SoCs
>>>>>>>>> "Work should be fun and collaborative, the rest follows"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> lng-odp mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
lng-odp mailing list
[email protected]
https://lists.linaro.org/mailman/listinfo/lng-odp

Re: [lng-odp] [API-NEXT, RFC, 1/1] api: cpu: performance profiling start/stop

Reply via email to