stop

Yi He Tue, 10 May 2016 06:08:37 -0700

Hi, Petri

While we can continue processor-related discussions in Bill's new
comprehensive email thread, about ODP-427 of how to guarantee locality of
odp_cpu_xxx() APIs, can we make a decision between two choices in
tomorrow's ARCH meeting?


*Choice one: *constraint to ODP thread concept: every ODP thread will be
pinned on one CPU core. In this case, only the main thread was accidentally
not pinned on one core :), it is an ODP_THREAD_CONTROL, but is not
instantiated through odph_linux_pthread_create().

The solution can be: in odp_init_global() API, after
odp_cpumask_init_global(), pin the main thread to the 1st available core
for control threads.

*Choice two: *in case to allow ODP thread migration between CPU cores, new
APIs are required to enable/disable CPU migration on the fly. (as patch
suggested).

Let's talk in tomorrow. Thanks and Best Regards, Yi

On 10 May 2016 at 04:54, Bill Fischofer <[email protected]> wrote:

>
>
> On Mon, May 9, 2016 at 1:50 AM, Yi He <[email protected]> wrote:
>
>> Hi, Bill
>>
>> Thanks very much for your detailed explanation. I understand the
>> programming practise like:
>>
>> /* Firstly developer got a chance to specify core availabilities to the
>>  * application instance.
>>  */
>> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ... )
>>
>> *So It is possible to run an application with different core
>> availabilities spec on different platform, **and possible to run
>> multiple application instances on one platform in isolation.*
>>
>> *A: Make the above as command line parameters can help application binary
>> portable, run it on platform A or B requires no re-compilation, but only
>> invocation parameters change.*
>>
>
> The intent behind the ability to specify cpumasks at odp_init_global()
> time is to allow a launcher script that is configured by some provisioning
> agent (e.g., OpenDaylight) to communicate core assignments down to the ODP
> implementation in a platform-independent manner.  So applications will fall
> into two categories, those that have provisioned coremasks that simply get
> passed through and more "stand alone" applications that will us
> odp_cpumask_all_available() and odp_cpumask_default_worker/control() as
> noted earlier to size themselves dynamically to the available processing
> resources.  In both cases there is no need to recompile the application but
> rather to simply have it create an appropriate number of control/worker
> threads as determined either by external configuration or inquiry.
>
>
>>
>> /* Application developer fanout worker/control threads depends on
>>  * the needs and actual availabilities.
>>  */
>> actually_available_cores =
>>     odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);
>>
>> iterator( actually_available_cores ) {
>>
>>     /* Fanout one work thread instance */
>>     odph_linux_pthread_create(...upon one available core...);
>> }
>>
>> *B: Is odph_linux_pthread_create() a temporary helper API and will
>> converge into platform-independent odp_thread_create(..one core spec...) in
>> future? Or, is it deliberately left as platform dependant helper API?*
>>
>> Based on above understanding and back to ODP-427 problem, which seems
>> only the main thread (program entrance) was accidentally not pinned on one
>> core :), the main thread is also an ODP_THREAD_CONTROL, but was not
>> instantiated through odph_linux_pthread_create().
>>
>
> ODP provides no APIs or helpers to control thread pinning. The only
> controls ODP provides is the ability to know the number of available cores,
> to partition them for use by worker and control threads, and the ability
> (via helpers) to create a number of threads of the application's choosing.
> The implementation is expected to schedule these threads to available cores
> in a fair manner, so if the number of application threads is less than or
> equal to the available number of cores then implementations SHOULD (but are
> not required to) pin each thread to its own core. Applications SHOULD NOT
> be designed to require or depend on any specify thread-to-core mapping both
> for portability as well as because what constitutes a "core" in a virtual
> environment may or may not represent dedicated hardware.
>
>
>>
>> A solution can be: in odp_init_global() API, after
>> odp_cpumask_init_global(), pin the main thread to the 1st available core
>> for control thread. This adds new behavioural specification to this API,
>> but seems natural. Actually Ivan's patch did most of this, except that the
>> core was fixed to 0. we can discuss in today's meeting.
>>
>
> An application may consist of more than a single thread at the time it
> calls odp_init_global(), however it is RECOMMENDED that odp_init_global()
> be called only from the application's initial thread and before it creates
> any other threads to avoid the address space confusion that has been the
> subject of the past couple of ARCH calls and that we are looking to achieve
> consensus on. I'd like to move that discussion to a separate discussion
> thread from this one, if you don't mind.
>
>
>>
>> Thanks and Best Regards, Yi
>>
>> On 6 May 2016 at 22:23, Bill Fischofer <[email protected]> wrote:
>>
>>> These are all good questions. ODP divides threads into worker threads
>>> and control threads. The distinction is that worker threads are supposed to
>>> be performance sensitive and perform optimally with dedicated cores while
>>> control threads perform more "housekeeping" functions and would be less
>>> impacted by sharing cores.
>>>
>>> In the absence of explicit API calls, it is unspecified how an ODP
>>> implementation assigns threads to cores. The distinction between worker and
>>> control thread is a hint to the underlying implementation that should be
>>> used in managing available processor resources.
>>>
>>> The APIs in cpumask.h enable applications to determine how many CPUs are
>>> available to it and how to divide them among worker and control threads
>>> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note
>>> that ODP does not provide APIs for setting specific threads to specific
>>> CPUs, so keep that in mind in the answers below.
>>>
>>>
>>> On Thu, May 5, 2016 at 7:59 AM, Yi He <[email protected]> wrote:
>>>
>>>> Hi, thanks Bill
>>>>
>>>> I understand more deeply of ODP thread concept and in embedded app
>>>> developers are involved in target platform tuning/optimization.
>>>>
>>>> Can I have a little example: say we have a data-plane app which
>>>> includes 3 ODP threads. And would like to install and run it upon 2
>>>> platforms.
>>>>
>>>>    - Platform A: 2 cores.
>>>>    - Platform B: 10 cores
>>>>
>>>> During initialization, the application can use
>>> odp_cpumask_all_available() to determine how many CPUs are available and
>>> can (optionally) use odp_cpumask_default_worker() and
>>> odp_cpumask_default_control() to divide them into CPUs that should be used
>>> for worker and control threads, respectively. For an application designed
>>> for scale-out, the number of available CPUs would typically be used to
>>> control how many worker threads the application creates. If the number of
>>> worker threads matches the number of worker CPUs then the ODP
>>> implementation would be expected to dedicate a worker core to each worker
>>> thread. If more threads are created than there are corresponding cores,
>>> then it is up to each implementation as to how it multiplexes them among
>>> the available cores in a fair manner.
>>>
>>>
>>>> Question, which one of the below assumptions is the current ODP
>>>> programming model?
>>>>
>>>> *1, *Application developer writes target platform specific code to
>>>> tell that:
>>>>
>>>> On platform A run threads (0) on core (0), and threads (1,2) on core
>>>> (1).
>>>> On platform B run threads (0) on core (0), and threads (1) can scale
>>>> out and duplicate 8 instances on core (1~8), and thread (2) on core (9).
>>>>
>>>
>>> As noted, ODP does not provide APIs that permit specific threads to be
>>> assigned to specific cores. Instead it is up to each ODP implementation as
>>> to how it maps ODP threads to available CPUs, subject to the advisory
>>> information provided by the ODP thread type and the cpumask assignments for
>>> control and worker threads. So in these examples suppose what the
>>> application has is two control threads and one or more workers.  For
>>> Platform A you might have core 0 defined for control threads and Core 1 for
>>> worker threads. In this case threads 0 and 1 would run on Core 0 while
>>> thread 2 ran on Core 1. For Platform B it's again up to the application how
>>> it wants to divide the 10 CPUs between control and worker. It may want to
>>> have 2 control CPUs so that each control thread can have its own core,
>>> leaving 8 worker threads, or it might have the control threads share a
>>> single CPU and have 9 worker threads with their own cores.
>>>
>>>
>>>>
>>>>
>>> Install and run on different platform requires above platform specific
>>>> code and recompilation for target.
>>>>
>>>
>>> No. As noted, the model is the same. The only difference is how many
>>> control/worker threads the application chooses to create based on the
>>> information it gets during initialization by odp_cpumask_all_available().
>>>
>>>
>>>>
>>>> *2, *Application developer writes code to specify:
>>>>
>>>> Threads (0, 2) would not scale out
>>>> Threads (1) can scale out (up to a limit N?)
>>>> Platform A has 3 cores available (as command line parameter?)
>>>> Platform B has 10 cores available (as command line parameter?)
>>>>
>>>> Install and run on different platform may not requires re-compilation.
>>>> ODP intelligently arrange the threads according to the information
>>>> provided.
>>>>
>>>
>>> Applications determine the minimum number of threads they require. For
>>> most applications they would tend to have a fixed number of control threads
>>> (based on the application's functional design) and a variable number of
>>> worker threads (minimum 1) based on available processing resources. These
>>> application-defined minimums determine the minimum configuration the
>>> application might need for optimal performance, with scale out to larger
>>> configurations performed automatically.
>>>
>>>
>>>>
>>>> Last question: in some case like power save mode available cores shrink
>>>> would ODP intelligently re-arrange the ODP threads dynamically in runtime?
>>>>
>>>
>>> The intent is that while control threads may have distinct roles and
>>> responsibilities (thus requiring that all always be eligible to be
>>> scheduled) worker threads are symmetric and interchangeable. So in this
>>> case if I have N worker threads to match to the N available worker CPUs and
>>> power save mode wants to reduce that number to N-1, then the only effect is
>>> that the worker CPU entering power save mode goes dormant along with the
>>> thread that is running on it. That thread isn't redistributed to some other
>>> core because it's the same as the other worker threads.  Its is expected
>>> that cores would only enter power save state at odp_schedule() boundaries.
>>> So for example, if odp_schedule() determines that there is no work to
>>> dispatch to this thread then that might trigger the associated CPU to enter
>>> low power mode. When later that core wakes up odp_schedule() would continue
>>> and then return work to its reactivated thread.
>>>
>>> A slight wrinkle here is the concept of scheduler groups, which allows
>>> work classes to be dispatched to different groups of worker threads.  In
>>> this case the implementation might want to take scheduler group membership
>>> into consideration in determining which cores to idle for power savings.
>>> However, the ODP API itself is silent on this subject as it is
>>> implementation dependent how power save modes are managed.
>>>
>>>
>>>>
>>>> Thanks and Best Regards, Yi
>>>>
>>>
>>> Thank you for these questions. I answering them I realized we do not
>>> (yet) have this information covered in the ODP User Guide. I'll be using
>>> this information to help fill in that gap.
>>>
>>>
>>>>
>>>> On 5 May 2016 at 18:50, Bill Fischofer <[email protected]>
>>>> wrote:
>>>>
>>>>> I've added this to the agenda for Monday's call, however I suggest we
>>>>> continue the dialog here as well as background.
>>>>>
>>>>> Regarding thread pinning, there's always been a tradeoff on that.  On
>>>>> the one hand dedicating cores to threads is ideal for scale out in many
>>>>> core systems, however ODP does not require many core environments to work
>>>>> effectively, so ODP APIs enable but do not require or assume that cores 
>>>>> are
>>>>> dedicated to threads. That's really a question of application design and
>>>>> fit to the particular platform it's running on. In embedded environments
>>>>> you'll likely see this model more since the application knows which
>>>>> platform it's being targeted for. In VNF environments, by contrast, you're
>>>>> more likely to see a blend where applications will take advantage of
>>>>> however many cores are available to it but will still run without 
>>>>> dedicated
>>>>> cores in environments with more modest resources.
>>>>>
>>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <[email protected]> wrote:
>>>>>
>>>>>> Hi, thanks Mike and Bill,
>>>>>>
>>>>>> From your clear summarize can we put it into several TO-DO decisions:
>>>>>> (we can have a discussion in next ARCH call):
>>>>>>
>>>>>>    1. How to addressing the precise semantics of the existing timing
>>>>>>    APIs (odp_cpu_xxx) as they relate to processor locality.
>>>>>>
>>>>>>
>>>>>>    - *A:* guarantee by adding constraint to ODP thread concept:
>>>>>>    every ODP thread shall be deployed and pinned on one CPU core.
>>>>>>       - A sub-question: my understanding is that application
>>>>>>       programmers only need to specify available CPU sets for 
>>>>>> control/worker
>>>>>>       threads, and it is ODP to arrange the threads onto each CPU core 
>>>>>> while
>>>>>>       launching, right?
>>>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU
>>>>>>    migration.
>>>>>>    - Then document clearly in user's guide or API document.
>>>>>>
>>>>>>
>>>>>>    1. Understand the requirement to have both processor-local and
>>>>>>    system-wide timing APIs:
>>>>>>
>>>>>>
>>>>>>    - There are some APIs available in time.h (odp_time_local(), etc).
>>>>>>    - We can have a thread to understand the relationship, usage
>>>>>>    scenarios and constraints of APIs in time.h and cpu.h.
>>>>>>
>>>>>> Best Regards, Yi
>>>>>>
>>>>>> On 4 May 2016 at 23:32, Bill Fischofer <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I think there are two fallouts form this discussion.  First, there
>>>>>>> is the question of the precise semantics of the existing timing APIs as
>>>>>>> they relate to processor locality. Applications such as profiling 
>>>>>>> tests, to
>>>>>>> the extent that they APIs that have processor-local semantics, must 
>>>>>>> ensure
>>>>>>> that the thread(s) using these APIs are pinned for the duration of the
>>>>>>> measurement.
>>>>>>>
>>>>>>> The other point is the one that Petri brought up about having other
>>>>>>> APIs that provide timing information based on wall time or other metrics
>>>>>>> that are not processor-local.  While these may not have the same
>>>>>>> performance characteristics, they would be independent of thread 
>>>>>>> migration
>>>>>>> considerations.
>>>>>>>
>>>>>>> Of course all this depends on exactly what one is trying to measure.
>>>>>>> Since thread migration is not free, allowing such activity may or may 
>>>>>>> not
>>>>>>> be relevant to what is being measured, so ODP probably wants to have 
>>>>>>> both
>>>>>>> processor-local and systemwide timing APIs.  We just need to be sure 
>>>>>>> they
>>>>>>> are specified precisely so that applications know how to use them 
>>>>>>> properly.
>>>>>>>
>>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> It sounded like the arch call was leaning towards documenting that
>>>>>>>> on odp-linux  the application must ensure that odp_threads are pinned 
>>>>>>>> to
>>>>>>>> cores when launched.
>>>>>>>> This is a restriction that some platforms may not need to make, vs
>>>>>>>> the idea that a piece of ODP code can use these APIs to ensure the 
>>>>>>>> behavior
>>>>>>>> it needs without knowledge or reliance on the wider system.
>>>>>>>>
>>>>>>>> On 4 May 2016 at 01:45, Yi He <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Establish a performance profiling environment guarantees meaningful
>>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx()
>>>>>>>>> APIs.
>>>>>>>>> While after profiling was done restore the execution environment to
>>>>>>>>> its multi-core optimized state.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Yi He <[email protected]>
>>>>>>>>> ---
>>>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++
>>>>>>>>>  1 file changed, 31 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/include/odp/api/spec/cpu.h
>>>>>>>>> b/include/odp/api/spec/cpu.h
>>>>>>>>> index 2789511..0bc9327 100644
>>>>>>>>> --- a/include/odp/api/spec/cpu.h
>>>>>>>>> +++ b/include/odp/api/spec/cpu.h
>>>>>>>>> @@ -27,6 +27,21 @@ extern "C" {
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  /**
>>>>>>>>> + * @typedef odp_profiler_t
>>>>>>>>> + * ODP performance profiler handle
>>>>>>>>> + */
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * Setup a performance profiling environment
>>>>>>>>> + *
>>>>>>>>> + * A performance profiling environment guarantees meaningful and
>>>>>>>>> consistency of
>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.
>>>>>>>>> + *
>>>>>>>>> + * @return performance profiler handle
>>>>>>>>> + */
>>>>>>>>> +odp_profiler_t odp_profiler_start(void);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>>   * CPU identifier
>>>>>>>>>   *
>>>>>>>>>   * Determine CPU identifier on which the calling is running. CPU
>>>>>>>>> numbering is
>>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);
>>>>>>>>>  void odp_cpu_pause(void);
>>>>>>>>>
>>>>>>>>>  /**
>>>>>>>>> + * Stop the performance profiling environment
>>>>>>>>> + *
>>>>>>>>> + * Stop performance profiling and restore the execution
>>>>>>>>> environment to its
>>>>>>>>> + * multi-core optimized state, won't preserve meaningful and
>>>>>>>>> consistency of
>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.
>>>>>>>>> + *
>>>>>>>>> + * @param profiler  performance profiler handle
>>>>>>>>> + *
>>>>>>>>> + * @retval 0 on success
>>>>>>>>> + * @retval <0 on failure
>>>>>>>>> + *
>>>>>>>>> + * @see odp_profiler_start()
>>>>>>>>> + */
>>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>>   * @}
>>>>>>>>>   */
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 1.9.1
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> lng-odp mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mike Holmes
>>>>>>>> Technical Manager - Linaro Networking Group
>>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for
>>>>>>>> ARM SoCs
>>>>>>>> "Work should be fun and collaborative, the rest follows"
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> lng-odp mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
lng-odp mailing list
[email protected]
https://lists.linaro.org/mailman/listinfo/lng-odp

Re: [lng-odp] [API-NEXT, RFC, 1/1] api: cpu: performance profiling start/stop

Reply via email to