stop

Yi He Thu, 12 May 2016 01:42:12 -0700

Hi, Bill and Petri

After yesterday's discussion/email I realized some strict rules in ARCH:


   1. odp apis do not want to handle thread-to-core arrangement except only
   providing availability info, right? (may be repeated several time until I
   got it :)).
   2. then application needs odp-helper api to accomplish this, in other
   words, odp-helper api takes the responsibility to provide methods for "odp
   threads' deployment and instantiation".

For ODP-427, narrow down the goal to "pin main thread to core 0 by
odp-helper to prevent validation tests fail, meanwhile adding no
constraints to odp apis at all", sounds good?

Then here is the proposal: for libodphelper-linux library, add
__attribute__(constructor) constructor function to pin the main thread to
core 0. fulfil above goal and require no application code addition. (only
drawback is that cannot pin to the 1st available control thread core
because it is before odp_init_global(), no perfect world :)).

In meanwhile we can open new thread to further discuss on the big topics:
1, odp thread concept and their deploy and instantiation on platform.
2, further advanced topics around (to list).

thanks and best regards, Yi


On 11 May 2016 at 08:08, Bill Fischofer <[email protected]> wrote:

> We didn't get around to discussing this during today's public call, but
> we'll try to cover this during tomorrow's ARCH call.
>
> As I noted earlier, the question of thread assignment to cores becomes
> complicated in virtual environment when it's not clear that a (virtual)
> core necessarily implies that there is any dedicated HW behind it. I think
> the simplest approach to take is simply to say that as long as the number
> of ODP threads is less than or equal to the number reported by
> odp_cpumask_all_available() and the number of control threads does not
> exceed odp_cpumask_control_default() and the number of worker threads does
> not exceed odp_cpumask_worker_default(), then the application can assume
> that each ODP thread will have its own CPU. If the thread count exceeds
> these numbers, then it is implementation-defined as to how ODP threads are
> multiplexed onto available CPUs in a fair manner.
>
> Applications that want best performance will adapt their thread usage to
> the number of CPUs available to it (subject to application-defined
> minimums, perhaps) to ensure that they don't have more threads than CPUs.
>
> If we have this convention then perhaps no additional APIs are needed to
> cover pinning/migration considerations?
>
> On Tue, May 10, 2016 at 8:04 AM, Yi He <[email protected]> wrote:
>
>> Hi, Petri
>>
>> While we can continue processor-related discussions in Bill's new
>> comprehensive email thread, about ODP-427 of how to guarantee locality of
>> odp_cpu_xxx() APIs, can we make a decision between two choices in
>> tomorrow's ARCH meeting?
>>
>> *Choice one: *constraint to ODP thread concept: every ODP thread will be
>> pinned on one CPU core. In this case, only the main thread was accidentally
>> not pinned on one core :), it is an ODP_THREAD_CONTROL, but is not
>> instantiated through odph_linux_pthread_create().
>>
>> The solution can be: in odp_init_global() API, after
>> odp_cpumask_init_global(), pin the main thread to the 1st available core
>> for control threads.
>>
>> *Choice two: *in case to allow ODP thread migration between CPU cores,
>> new APIs are required to enable/disable CPU migration on the fly. (as patch
>> suggested).
>>
>> Let's talk in tomorrow. Thanks and Best Regards, Yi
>>
>> On 10 May 2016 at 04:54, Bill Fischofer <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Mon, May 9, 2016 at 1:50 AM, Yi He <[email protected]> wrote:
>>>
>>>> Hi, Bill
>>>>
>>>> Thanks very much for your detailed explanation. I understand the
>>>> programming practise like:
>>>>
>>>> /* Firstly developer got a chance to specify core availabilities to the
>>>>  * application instance.
>>>>  */
>>>> odp_init_global(... odp_init_t *param->worker_cpus & ->control_cpus ...
>>>> )
>>>>
>>>> *So It is possible to run an application with different core
>>>> availabilities spec on different platform, **and possible to run
>>>> multiple application instances on one platform in isolation.*
>>>>
>>>> *A: Make the above as command line parameters can help application
>>>> binary portable, run it on platform A or B requires no re-compilation, but
>>>> only invocation parameters change.*
>>>>
>>>
>>> The intent behind the ability to specify cpumasks at odp_init_global()
>>> time is to allow a launcher script that is configured by some provisioning
>>> agent (e.g., OpenDaylight) to communicate core assignments down to the ODP
>>> implementation in a platform-independent manner.  So applications will fall
>>> into two categories, those that have provisioned coremasks that simply get
>>> passed through and more "stand alone" applications that will us
>>> odp_cpumask_all_available() and odp_cpumask_default_worker/control() as
>>> noted earlier to size themselves dynamically to the available processing
>>> resources.  In both cases there is no need to recompile the application but
>>> rather to simply have it create an appropriate number of control/worker
>>> threads as determined either by external configuration or inquiry.
>>>
>>>
>>>>
>>>> /* Application developer fanout worker/control threads depends on
>>>>  * the needs and actual availabilities.
>>>>  */
>>>> actually_available_cores =
>>>>     odp_cpumask_default_worker(&cores, needs_to_fanout_N_workers);
>>>>
>>>> iterator( actually_available_cores ) {
>>>>
>>>>     /* Fanout one work thread instance */
>>>>     odph_linux_pthread_create(...upon one available core...);
>>>> }
>>>>
>>>> *B: Is odph_linux_pthread_create() a temporary helper API and will
>>>> converge into platform-independent odp_thread_create(..one core spec...) in
>>>> future? Or, is it deliberately left as platform dependant helper API?*
>>>>
>>>> Based on above understanding and back to ODP-427 problem, which seems
>>>> only the main thread (program entrance) was accidentally not pinned on one
>>>> core :), the main thread is also an ODP_THREAD_CONTROL, but was not
>>>> instantiated through odph_linux_pthread_create().
>>>>
>>>
>>> ODP provides no APIs or helpers to control thread pinning. The only
>>> controls ODP provides is the ability to know the number of available cores,
>>> to partition them for use by worker and control threads, and the ability
>>> (via helpers) to create a number of threads of the application's choosing.
>>> The implementation is expected to schedule these threads to available cores
>>> in a fair manner, so if the number of application threads is less than or
>>> equal to the available number of cores then implementations SHOULD (but are
>>> not required to) pin each thread to its own core. Applications SHOULD NOT
>>> be designed to require or depend on any specify thread-to-core mapping both
>>> for portability as well as because what constitutes a "core" in a virtual
>>> environment may or may not represent dedicated hardware.
>>>
>>>
>>>>
>>>> A solution can be: in odp_init_global() API, after
>>>> odp_cpumask_init_global(), pin the main thread to the 1st available core
>>>> for control thread. This adds new behavioural specification to this API,
>>>> but seems natural. Actually Ivan's patch did most of this, except that the
>>>> core was fixed to 0. we can discuss in today's meeting.
>>>>
>>>
>>> An application may consist of more than a single thread at the time it
>>> calls odp_init_global(), however it is RECOMMENDED that odp_init_global()
>>> be called only from the application's initial thread and before it creates
>>> any other threads to avoid the address space confusion that has been the
>>> subject of the past couple of ARCH calls and that we are looking to achieve
>>> consensus on. I'd like to move that discussion to a separate discussion
>>> thread from this one, if you don't mind.
>>>
>>>
>>>>
>>>> Thanks and Best Regards, Yi
>>>>
>>>> On 6 May 2016 at 22:23, Bill Fischofer <[email protected]>
>>>> wrote:
>>>>
>>>>> These are all good questions. ODP divides threads into worker threads
>>>>> and control threads. The distinction is that worker threads are supposed 
>>>>> to
>>>>> be performance sensitive and perform optimally with dedicated cores while
>>>>> control threads perform more "housekeeping" functions and would be less
>>>>> impacted by sharing cores.
>>>>>
>>>>> In the absence of explicit API calls, it is unspecified how an ODP
>>>>> implementation assigns threads to cores. The distinction between worker 
>>>>> and
>>>>> control thread is a hint to the underlying implementation that should be
>>>>> used in managing available processor resources.
>>>>>
>>>>> The APIs in cpumask.h enable applications to determine how many CPUs
>>>>> are available to it and how to divide them among worker and control 
>>>>> threads
>>>>> (odp_cpumask_default_worker() and odp_cpumask_default_control()).  Note
>>>>> that ODP does not provide APIs for setting specific threads to specific
>>>>> CPUs, so keep that in mind in the answers below.
>>>>>
>>>>>
>>>>> On Thu, May 5, 2016 at 7:59 AM, Yi He <[email protected]> wrote:
>>>>>
>>>>>> Hi, thanks Bill
>>>>>>
>>>>>> I understand more deeply of ODP thread concept and in embedded app
>>>>>> developers are involved in target platform tuning/optimization.
>>>>>>
>>>>>> Can I have a little example: say we have a data-plane app which
>>>>>> includes 3 ODP threads. And would like to install and run it upon 2
>>>>>> platforms.
>>>>>>
>>>>>>    - Platform A: 2 cores.
>>>>>>    - Platform B: 10 cores
>>>>>>
>>>>>> During initialization, the application can use
>>>>> odp_cpumask_all_available() to determine how many CPUs are available and
>>>>> can (optionally) use odp_cpumask_default_worker() and
>>>>> odp_cpumask_default_control() to divide them into CPUs that should be used
>>>>> for worker and control threads, respectively. For an application designed
>>>>> for scale-out, the number of available CPUs would typically be used to
>>>>> control how many worker threads the application creates. If the number of
>>>>> worker threads matches the number of worker CPUs then the ODP
>>>>> implementation would be expected to dedicate a worker core to each worker
>>>>> thread. If more threads are created than there are corresponding cores,
>>>>> then it is up to each implementation as to how it multiplexes them among
>>>>> the available cores in a fair manner.
>>>>>
>>>>>
>>>>>> Question, which one of the below assumptions is the current ODP
>>>>>> programming model?
>>>>>>
>>>>>> *1, *Application developer writes target platform specific code to
>>>>>> tell that:
>>>>>>
>>>>>> On platform A run threads (0) on core (0), and threads (1,2) on core
>>>>>> (1).
>>>>>> On platform B run threads (0) on core (0), and threads (1) can scale
>>>>>> out and duplicate 8 instances on core (1~8), and thread (2) on core (9).
>>>>>>
>>>>>
>>>>> As noted, ODP does not provide APIs that permit specific threads to be
>>>>> assigned to specific cores. Instead it is up to each ODP implementation as
>>>>> to how it maps ODP threads to available CPUs, subject to the advisory
>>>>> information provided by the ODP thread type and the cpumask assignments 
>>>>> for
>>>>> control and worker threads. So in these examples suppose what the
>>>>> application has is two control threads and one or more workers.  For
>>>>> Platform A you might have core 0 defined for control threads and Core 1 
>>>>> for
>>>>> worker threads. In this case threads 0 and 1 would run on Core 0 while
>>>>> thread 2 ran on Core 1. For Platform B it's again up to the application 
>>>>> how
>>>>> it wants to divide the 10 CPUs between control and worker. It may want to
>>>>> have 2 control CPUs so that each control thread can have its own core,
>>>>> leaving 8 worker threads, or it might have the control threads share a
>>>>> single CPU and have 9 worker threads with their own cores.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>> Install and run on different platform requires above platform specific
>>>>>> code and recompilation for target.
>>>>>>
>>>>>
>>>>> No. As noted, the model is the same. The only difference is how many
>>>>> control/worker threads the application chooses to create based on the
>>>>> information it gets during initialization by odp_cpumask_all_available().
>>>>>
>>>>>
>>>>>>
>>>>>> *2, *Application developer writes code to specify:
>>>>>>
>>>>>> Threads (0, 2) would not scale out
>>>>>> Threads (1) can scale out (up to a limit N?)
>>>>>> Platform A has 3 cores available (as command line parameter?)
>>>>>> Platform B has 10 cores available (as command line parameter?)
>>>>>>
>>>>>> Install and run on different platform may not requires re-compilation.
>>>>>> ODP intelligently arrange the threads according to the information
>>>>>> provided.
>>>>>>
>>>>>
>>>>> Applications determine the minimum number of threads they require. For
>>>>> most applications they would tend to have a fixed number of control 
>>>>> threads
>>>>> (based on the application's functional design) and a variable number of
>>>>> worker threads (minimum 1) based on available processing resources. These
>>>>> application-defined minimums determine the minimum configuration the
>>>>> application might need for optimal performance, with scale out to larger
>>>>> configurations performed automatically.
>>>>>
>>>>>
>>>>>>
>>>>>> Last question: in some case like power save mode available cores
>>>>>> shrink would ODP intelligently re-arrange the ODP threads dynamically in
>>>>>> runtime?
>>>>>>
>>>>>
>>>>> The intent is that while control threads may have distinct roles and
>>>>> responsibilities (thus requiring that all always be eligible to be
>>>>> scheduled) worker threads are symmetric and interchangeable. So in this
>>>>> case if I have N worker threads to match to the N available worker CPUs 
>>>>> and
>>>>> power save mode wants to reduce that number to N-1, then the only effect 
>>>>> is
>>>>> that the worker CPU entering power save mode goes dormant along with the
>>>>> thread that is running on it. That thread isn't redistributed to some 
>>>>> other
>>>>> core because it's the same as the other worker threads.  Its is expected
>>>>> that cores would only enter power save state at odp_schedule() boundaries.
>>>>> So for example, if odp_schedule() determines that there is no work to
>>>>> dispatch to this thread then that might trigger the associated CPU to 
>>>>> enter
>>>>> low power mode. When later that core wakes up odp_schedule() would 
>>>>> continue
>>>>> and then return work to its reactivated thread.
>>>>>
>>>>> A slight wrinkle here is the concept of scheduler groups, which allows
>>>>> work classes to be dispatched to different groups of worker threads.  In
>>>>> this case the implementation might want to take scheduler group membership
>>>>> into consideration in determining which cores to idle for power savings.
>>>>> However, the ODP API itself is silent on this subject as it is
>>>>> implementation dependent how power save modes are managed.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks and Best Regards, Yi
>>>>>>
>>>>>
>>>>> Thank you for these questions. I answering them I realized we do not
>>>>> (yet) have this information covered in the ODP User Guide. I'll be using
>>>>> this information to help fill in that gap.
>>>>>
>>>>>
>>>>>>
>>>>>> On 5 May 2016 at 18:50, Bill Fischofer <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I've added this to the agenda for Monday's call, however I suggest
>>>>>>> we continue the dialog here as well as background.
>>>>>>>
>>>>>>> Regarding thread pinning, there's always been a tradeoff on that.
>>>>>>> On the one hand dedicating cores to threads is ideal for scale out in 
>>>>>>> many
>>>>>>> core systems, however ODP does not require many core environments to 
>>>>>>> work
>>>>>>> effectively, so ODP APIs enable but do not require or assume that cores 
>>>>>>> are
>>>>>>> dedicated to threads. That's really a question of application design and
>>>>>>> fit to the particular platform it's running on. In embedded environments
>>>>>>> you'll likely see this model more since the application knows which
>>>>>>> platform it's being targeted for. In VNF environments, by contrast, 
>>>>>>> you're
>>>>>>> more likely to see a blend where applications will take advantage of
>>>>>>> however many cores are available to it but will still run without 
>>>>>>> dedicated
>>>>>>> cores in environments with more modest resources.
>>>>>>>
>>>>>>> On Wed, May 4, 2016 at 9:45 PM, Yi He <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi, thanks Mike and Bill,
>>>>>>>>
>>>>>>>> From your clear summarize can we put it into several TO-DO
>>>>>>>> decisions: (we can have a discussion in next ARCH call):
>>>>>>>>
>>>>>>>>    1. How to addressing the precise semantics of the existing
>>>>>>>>    timing APIs (odp_cpu_xxx) as they relate to processor locality.
>>>>>>>>
>>>>>>>>
>>>>>>>>    - *A:* guarantee by adding constraint to ODP thread concept:
>>>>>>>>    every ODP thread shall be deployed and pinned on one CPU core.
>>>>>>>>       - A sub-question: my understanding is that application
>>>>>>>>       programmers only need to specify available CPU sets for 
>>>>>>>> control/worker
>>>>>>>>       threads, and it is ODP to arrange the threads onto each CPU core 
>>>>>>>> while
>>>>>>>>       launching, right?
>>>>>>>>    - *B*: guarantee by adding new APIs to disable/enable CPU
>>>>>>>>    migration.
>>>>>>>>    - Then document clearly in user's guide or API document.
>>>>>>>>
>>>>>>>>
>>>>>>>>    1. Understand the requirement to have both processor-local and
>>>>>>>>    system-wide timing APIs:
>>>>>>>>
>>>>>>>>
>>>>>>>>    - There are some APIs available in time.h (odp_time_local(),
>>>>>>>>    etc).
>>>>>>>>    - We can have a thread to understand the relationship, usage
>>>>>>>>    scenarios and constraints of APIs in time.h and cpu.h.
>>>>>>>>
>>>>>>>> Best Regards, Yi
>>>>>>>>
>>>>>>>> On 4 May 2016 at 23:32, Bill Fischofer <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think there are two fallouts form this discussion.  First, there
>>>>>>>>> is the question of the precise semantics of the existing timing APIs 
>>>>>>>>> as
>>>>>>>>> they relate to processor locality. Applications such as profiling 
>>>>>>>>> tests, to
>>>>>>>>> the extent that they APIs that have processor-local semantics, must 
>>>>>>>>> ensure
>>>>>>>>> that the thread(s) using these APIs are pinned for the duration of the
>>>>>>>>> measurement.
>>>>>>>>>
>>>>>>>>> The other point is the one that Petri brought up about having
>>>>>>>>> other APIs that provide timing information based on wall time or other
>>>>>>>>> metrics that are not processor-local.  While these may not have the 
>>>>>>>>> same
>>>>>>>>> performance characteristics, they would be independent of thread 
>>>>>>>>> migration
>>>>>>>>> considerations.
>>>>>>>>>
>>>>>>>>> Of course all this depends on exactly what one is trying to
>>>>>>>>> measure. Since thread migration is not free, allowing such activity 
>>>>>>>>> may or
>>>>>>>>> may not be relevant to what is being measured, so ODP probably wants 
>>>>>>>>> to
>>>>>>>>> have both processor-local and systemwide timing APIs.  We just need 
>>>>>>>>> to be
>>>>>>>>> sure they are specified precisely so that applications know how to 
>>>>>>>>> use them
>>>>>>>>> properly.
>>>>>>>>>
>>>>>>>>> On Wed, May 4, 2016 at 10:23 AM, Mike Holmes <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> It sounded like the arch call was leaning towards documenting
>>>>>>>>>> that on odp-linux  the application must ensure that odp_threads are 
>>>>>>>>>> pinned
>>>>>>>>>> to cores when launched.
>>>>>>>>>> This is a restriction that some platforms may not need to make,
>>>>>>>>>> vs the idea that a piece of ODP code can use these APIs to ensure the
>>>>>>>>>> behavior it needs without knowledge or reliance on the wider system.
>>>>>>>>>>
>>>>>>>>>> On 4 May 2016 at 01:45, Yi He <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Establish a performance profiling environment guarantees
>>>>>>>>>>> meaningful
>>>>>>>>>>> and consistency of consecutive invocations of the odp_cpu_xxx()
>>>>>>>>>>> APIs.
>>>>>>>>>>> While after profiling was done restore the execution environment
>>>>>>>>>>> to
>>>>>>>>>>> its multi-core optimized state.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Yi He <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>>  include/odp/api/spec/cpu.h | 31 +++++++++++++++++++++++++++++++
>>>>>>>>>>>  1 file changed, 31 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/odp/api/spec/cpu.h
>>>>>>>>>>> b/include/odp/api/spec/cpu.h
>>>>>>>>>>> index 2789511..0bc9327 100644
>>>>>>>>>>> --- a/include/odp/api/spec/cpu.h
>>>>>>>>>>> +++ b/include/odp/api/spec/cpu.h
>>>>>>>>>>> @@ -27,6 +27,21 @@ extern "C" {
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  /**
>>>>>>>>>>> + * @typedef odp_profiler_t
>>>>>>>>>>> + * ODP performance profiler handle
>>>>>>>>>>> + */
>>>>>>>>>>> +
>>>>>>>>>>> +/**
>>>>>>>>>>> + * Setup a performance profiling environment
>>>>>>>>>>> + *
>>>>>>>>>>> + * A performance profiling environment guarantees meaningful
>>>>>>>>>>> and consistency of
>>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs.
>>>>>>>>>>> + *
>>>>>>>>>>> + * @return performance profiler handle
>>>>>>>>>>> + */
>>>>>>>>>>> +odp_profiler_t odp_profiler_start(void);
>>>>>>>>>>> +
>>>>>>>>>>> +/**
>>>>>>>>>>>   * CPU identifier
>>>>>>>>>>>   *
>>>>>>>>>>>   * Determine CPU identifier on which the calling is running.
>>>>>>>>>>> CPU numbering is
>>>>>>>>>>> @@ -170,6 +185,22 @@ uint64_t odp_cpu_cycles_resolution(void);
>>>>>>>>>>>  void odp_cpu_pause(void);
>>>>>>>>>>>
>>>>>>>>>>>  /**
>>>>>>>>>>> + * Stop the performance profiling environment
>>>>>>>>>>> + *
>>>>>>>>>>> + * Stop performance profiling and restore the execution
>>>>>>>>>>> environment to its
>>>>>>>>>>> + * multi-core optimized state, won't preserve meaningful and
>>>>>>>>>>> consistency of
>>>>>>>>>>> + * consecutive invocations of the odp_cpu_xxx() APIs anymore.
>>>>>>>>>>> + *
>>>>>>>>>>> + * @param profiler  performance profiler handle
>>>>>>>>>>> + *
>>>>>>>>>>> + * @retval 0 on success
>>>>>>>>>>> + * @retval <0 on failure
>>>>>>>>>>> + *
>>>>>>>>>>> + * @see odp_profiler_start()
>>>>>>>>>>> + */
>>>>>>>>>>> +int odp_profiler_stop(odp_profiler_t profiler);
>>>>>>>>>>> +
>>>>>>>>>>> +/**
>>>>>>>>>>>   * @}
>>>>>>>>>>>   */
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> 1.9.1
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> lng-odp mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Mike Holmes
>>>>>>>>>> Technical Manager - Linaro Networking Group
>>>>>>>>>> Linaro.org <http://www.linaro.org/> *│ *Open source software for
>>>>>>>>>> ARM SoCs
>>>>>>>>>> "Work should be fun and collaborative, the rest follows"
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> lng-odp mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.linaro.org/mailman/listinfo/lng-odp
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

_______________________________________________
lng-odp mailing list
[email protected]
https://lists.linaro.org/mailman/listinfo/lng-odp

Re: [lng-odp] [API-NEXT, RFC, 1/1] api: cpu: performance profiling start/stop

Reply via email to