Re: [perfmon2] perfmon2 syscall interface rationale

stephane eranian Tue, 01 Jul 2008 12:59:43 -0700

Corey,

On Tue, Jul 1, 2008 at 6:58 PM, Corey J Ashford <[EMAIL PROTECTED]> wrote:
> You might want to add a separate section that details the thinking about
> why you don't want to use a single, multiplexed syscall.  If you add this,
> it could go after you've detailed the session breakdown, and before you've
> described the current syscalls.  I know this has been an area some of the
> LKML folks have picked at before.
>


That's a good point. I will add a paragraph about this.
But I was thinking that they would probably not oppose a mixed of the two.
several syscalls including one with multiplexing. For instance, I could envision
a pfm_controls() which could be used to start/stop attach/detach and leave
the other calls untouched.

> [EMAIL PROTECTED] wrote on 07/01/2008 09:41:36
> AM:
>
>> Hello everyone,
>>
>> I intend to send this following description to LKML and a few LKML
> developers
>> to try and explain the reasoning behind the current syscall interface
>> for perfmon2.
>>
>> I know there have been a lot of doubts and misunderstandings as to why
> we need
>> to many syscalls and how they could be extended. I tried to address
>> those concerns
>> here.
>>
>> Please feel free to comment, add to it.
>>
>> Thanks.
>>
>>
> -----------------------------------------------------------------------------------------------------------------------
>>
>> 1) monitoring session breakdown
>>
>> A monitoring session can be decomposed into a sequence of fundamental
>> actions which
>> are as follows:
>>        - create the session
>>        - program registers
>>        - attach to target thread or CPU
>>        - start monitoring
>>        - stop monitoring
>>        - read results
>>        - detach from thread or CPU
>>        - terminate session
>>
>> The order may not necessarily be like shown. For instance, the
>> programming may happen
>> after the session has been attached. Obviously, the start/stop
>> operations may be
>> repeated before results are read and results can be read multiple times.
>>
>> In the next sections, we examine each action separately.
>>
>> 2) session creation
>>
>>   Perfmon2 supports 2 types of sessions: per-thread or per-CPU (so
>> called system-wide)
>>
>>   During the creation of the session, certain attributes are set, they
>> remain until the
>>   session is terminated. For instance, the per-cpu attribute cannot
>> be changed.
>>
>>   During creation, the kernel state to support the session is
>> allocated and initialized.
>>   No PMU hardware is actually accessed. Permissions to create a
>> session may be checked.
>>   Resource limits are also validated and memory consumption is accounted
> for.
>>
>>   The software state of the PMU is initialized, i.e., all
>> configuration registers are
>>   set to a quiescent value. Data registers are initialized to zero
>> whenever possible.
>>
>>   Upon return, the kernel returns a unique identifier which is to be
>> used for all
>>   subsequent actions on the session.
>>
>> 3) programming the registers
>>
>>   Programming of the PMU registers can occur at any time during the
>> lifetime of a session,
>>   the session does not need to be attached to a thread of CPU.
>>
>>   It may be necessary to change the settings, e.g., monitor another
>> event or reset the counts
>>   when sampling at the user level. Thus, the writing of the registers
>> MUST be decoupled from
>>   the creation of the session.
>>
>>   Similarly, writing of configuration and data registers must also be
>> decoupled, as data
>>   registers may be reprogrammed independently of their configuration
>> registers, like when
>>   sampling for instance.
>>
>>   The number of registers varies a lot from one PMU to the other. The
>> relationships between
>>   configuration and data registers can be more complex than just
>> one-to-one. On most PMU,
>>   writing of the PMU registers requires running at the most privileged
>> level, i.e., in the
>>   kernel. To amortize the cost of a system call, it is interesting to
>> be able to program multiple
>>   registers in one call. Thus, it must be possible to pass vector
>> arguments. Of course,
>>   for security reasons, the system administrator may impose a limit on
>> how big vectors can
>>   actually be. The advantage is that vector can vary in size and thus
>> the amount of data
>>   passed between application and kernel can be optimized to be just
>> the minimal needed.
>>   System call data needs to be copied into the kernel memory space
>> before it can be used.
>>
>> 4) attachment and detachment
>>
>>  A session can be attached to a kernel-visible thread or a CPU. If
>> there is attachment,
>>  then it must be possible to detach the session to possibly re-attach
>> it to another thread
>>  or CPU. Detachment should not require destroying the session.
>>
>>  There are 3 possibilities for attachment:
>>        - when the session is created
>>        - when the monitoring is activated
>>        - with a dedicated call
>>
>>  If the attachment is done during the creation of the session, then it
>> means the target (thread or CPU)
>>  needs to exist at that time. For a cpu-wide session, this means that
>> the session must be created while
>>  executing on that CPU. This does not seem unreasonable especially on
>> NUMA systems.
>>
>>  For a per-thread session however, this is a bit more problematic as
>> this means it is not possible
>>  to prepare the session and the PMU registers before the thread
>> exists. When monitoring across fork
>>  and pthread_create, it is important to minimize overhead. Creation of
>> a session can trigger complex
>>  memory allocations in the kernel. Thus, it may be interesting to
>> prepare a batch of ready-to-go sessions,
>>  which just need to be attached when the fork or pthread_create
>> notification arrives.
>>
>>  If the attachment is coupled with the creation of the session, it
>> implies that the detachment is coupled
>>  with its destruction, by symmetry. Coupling of detachment with
>> termination is problematic for both per-thread
>>  and CPU-wide mode. With the former, the termination of a thread is
>> usually totally asynchronous with the
>>  termination of the session by the monitoring tool. The only case
>> where they are synchronized is for
>>  self-monitored threads. When a tool is monitoring a thread in another
>> process, the termination of that thread
>>  will cause the kernel to detach the session. But the session must not
>> be closed because the tool likely wants
>>  to read the results and also because the session still exists for the
>> tool. For CPU-wide, there is also an issue
>>  when a monitored CPU is put off-line dynamically. The session would
>> be detached by the kernel, yet the session would
>>  still be live in the tool whose controlling thread would have been
>> migrated off of that CPU.
>>
>>  If the attachment is done when monitoring is activated, then the
>> detachment is done when monitoring
>>  is deactivated. The following relationships are therefore enforced:
>>
>>        attached => activated
>>        stopped  => detached
>>
>>  It is expected that start/stop operations could be very frequent for
>> self-monitored workloads. When used
>>  to monitor small sections of critical code, e.g., loop kernels, it is
>> important to minimize overhead, thus
>>  the start/stop should be as simple as possible.
>>
>>  Attaching requires loading the PMU machine state onto the PMU
>> hardware. Conversely, detaching implies flushing
>>  the PMU state to memory so results can be read even after the
>> termination of a thread, for instance.  Both
>>  operations are expensive due to the high cost of accessing the PMU
> registers.
>>
>>  Furthermore, there are certain PMU models, e.g., Intel Itanium, where
>> it is possible to let user level code
>>  start/stop monitoring with a single instruction. To minimize
>> overhead, it is very important to allow this
>>  mechanism for self-monitored programs. Yet the session would have to
>> be attached/detached somehow. With
>>  dedicated attach/detach calls, this can be supported transparently.
>> One possible work-around with the coupled
>>  calls would be to require a system call to attach the session and do
>> the initial activation, subsequent
>>  start/stop could use the lightweight instruction. The session would
>> be stopped and detached with a system call.
>>
>>  The dedicated attach/detach calls offer a maximum level of
>> flexibility. The let applications create sessions
>>  in advance or on-demand. The actions on the session, start/stop and
>> attach/detach, are perfectly symmetrical.
>>  The termination of the monitored target can cause its detachment, but
>> the session remains accessible. Issuing
>>  of the detach call on a session already detached by the kernel is
> harmless.
>>
>>  The cost of start/stop is not impacted.
>>
>>  The following properties are enforced:
>>        upon attachment   => monitoring stopped
>>        during detachment => monitoring stopped
>>
>> 5) start and stop
>>
>>  It must be possible for an application to start and stop monitoring
>> at will and at any moment.
>>  Start and stop can be called very frequently and not just at the
>> beginning and end of a session.
>>  This is especially likely for self-monitored threads where it is
>> customary to monitor execution of
>>  only one function or loop. Thus those operations can be on the
>> critical path and they must therefore
>>  by as lightweight as possible. See the discussion in the section
>> about attachment and detachment.
>>
>>
>> 6) reading the results
>>
>>  The results are extracted by reading the PMU registers containing
>> data (as opposed to configuration).
>>  The number of registers of interest can vary based on the PMU model,
>> the type of measurement, the events
>>  measured.
>>
>>  Reading can occur at regular interval, e.g., time-based user level
>> sampling, and can therefore be on the
>>  critical path. Thus it must as lightweight as possible. Given that
>> the cost of dominated by the latency
>>  of accessing the PMU registers, it is important to only read the
>> registers that are used. Thus, the call
>>  must provide vector arguments just like for the calls to program the
> PMU.
>>
>>  It must be possible to read the registers while the session is
>> detached but also when it is attached to a
>>  thread or CPU.
>>
>> 7) termination
>>
>>  Termination of a session means all the associated resources are
>> either released to the free pool or destroyed.
>>  After termination, no state remains. Termination implies, stopping
>> monitoring and detaching the session if
>>  necessary.
>>
>>  For the purpose of termination, one has to differentiate between the
>> monitored entity and the controlling entity.
>>  When a tool monitors a thread in another process, all the threads
>> from the tool are controlling entities, and the
>>  monitored thread is the monitored entity. Any entity can vanish at any
> time.
>>
>>  If the monitored entity terminates voluntarily, i.e., normal exit, or
>> involuntarily, e.g., core dump, the kernel
>>  simply detaches the session but it is not destroyed.
>>
>>  Until the last controlling entity disappears, the session remains
> accessible.
>>
>>  There are situations where all the controlling entities disappear
>> before the monitored entity. In this case, the
>>  session becomes useless, results cannot be extracted, thus the
>> session enters the zombie state. It will
>>  eventually be detached and its resources will be reclaimed by the
>> kernel, i.e., the session will be terminated.
>>
>> 8) extensibility
>>
>>   There is already a vast diversity with existing PMU models, this is
>> unlikely to change, quite to the contrary
>>   it is envisioned that the PMU will become a true valid-add and that
>> vendors will therefore try to differentiate
>>   one from the other. Moreover, the PMU will remain closely tied to
>> the underlying micro-architecture. Therefore,
>>   it is very important to ensure that the monitoring interface will be
>> able to adapt easily to future PMU models
>>   and their extended features, i.e., what is offered beyond counting
> events.
>>
>>   It is important to realize that extensibility is not limited to
>> supporting more PMU registers. It also includes
>>   supporting advanced sampling features or socket-level PMUs as
>> opposed to just core-level PMUs.
>>
>>   It may be necessary to extend the system calls with new generic or
>> architecture specific parameters, and this
>>   without simply adding new system calls.
>>
>> 9) current perfmon2 interface
>>
>>   The perfmon2 interface design is guided by the principles described
>> in the previous sections.
>>   We now explain each call is details.
>>
>>
>>   a) session creation
>>
>>      int pfm_create_session(struct pfarg_ctx *ctx, char *smpl_name,
>> void *smpl_arg, size_t arg_size);
>>
>>      The function creates the perfmon session and returns a file
>> descriptor used to manipulate the session
>>      thereafter.
>>
>>      The calls takes several parameters which are as follows:
>>         - pfarg_ctx: encapsulates all session parameters (see below)
>>         - smpl_name: used when sampling to designate which format to use
>>         - smpl_arg:  point to format-specific arguments
>>         - smpl_size:  size of the structure passed in smpl_arg
>>
>>      The pfarg_ctx structure is defined as follows:
>>         - flags: generic and arch-specific flags for the session
>>         - reserved: reserved for future extensions
>>
>>      To provide for future extensions, the pfarg_ctx structure
>> contains reserved fields. Reserved fields
>>      must be zeroed.
>>
>>      To create a per-cpu session, the value PFM_CTX_SYSTEM_WIDE must
>> be passed in flags.
>>
>>      When in-kernel sampling is not used smpl_name, smpl_arg, arg_size
>> must be 0.
>>
>>   b) programming the registers
>>
>>      int pfm_write_pmcs(int fd, struct pfarg_pmc *pmcs, int n);
>>      int pfm_write_pmds(int fd, struct pfarg_pmd *pmds, int n);
>>
>>      The calls are provided to program the configuration and data
>> registers respectively. The parameters are
>>      as follows:
>>         - fd: file descriptor identifying the session
>>         - pmc: pointer to parg_pmc structures
>>         - pmd: pointer to parg_pmd structures
>>         - n : number of elements in the pmc or pmd vector
>>
>>      It is possible to pass vector of parg_pmc or pfarg_pmd registers.
>> The minimal size is 1, maximum size is
>>      determined by system administrator.
>>
>>      The pfarg_pmc structure is defined as follows:
>>      struct pfarg_pmc {
>>         u16 reg_num;
>>         u64 reg_value;
>>         u64 reserved[];
>>      };
>>
>>      The pfarg_pmd structure is defined as follows:
>>      struct pfarg_pmd {
>>         u16 reg_num;
>>         u64 reg_value;
>>         u64 reserved[];
>>      };
>>
>>      Although both structures are currently identical, they will
>> differ as more functionalities are added so better
>>      to create two versions from the start.
>>
>>      Provisions for extensions are provided by the reserved field in
>> each structure.
>>
>>
>>   c) attachment and detachment
>>
>>      int pfm_load_context(int fd, struct pfarg_load *ld);
>>      int pfm_unload_context(int fd);
>>
>>
>>      The session is identified by the file descriptor, fd.
>>
>>      To attach, the targeted thread or CPU must be provided. For
>> extensibility purposes, the target is passed in
>>      in structure which is defined as follows:
>>      struct pfarg_load {
>>         u32 target;
>>         u64 reserved[];
>>      };
>>      In per-thread mode, the target field must be set to the kernel
>> thread identification (gettid()).
>>
>>      In per-cpu mode, the target field must be set to the logical CPU
>> identification as seen by the kernel.
>>      Furthermore, the caller must be running on the CPU to monitor
>> otherwise the call fails.
>>
>>      Extensions can be implemented using the reserved field.
>>
>>
>>   d) start and stop
>>
>>      int pfm_start(int fd);
>>      int pfm_stop(int fd);
>>
>>      The session is identified by the file descriptor fd.
>>
>>      Currently no other parameters are supported for those calls.
>>
>>
>>    e) reading results
>>
>>      int pfm_read_pmds(int fd, struct pfarg_pmd *pmds, int n);
>>
>>
>>      The session is identified by the file descriptor fd.
>>
>>      Just like for programming the registers, it is possible to pass
>> vectors of structures in pmds. The number
>>      of elements is passed in n.
>>
>>
>>    f) termination
>>
>>      int close(fd);
>>
>>      To terminate a session, the file descriptor has to be closed. The
>> semantics of file descriptor sharing
>>      applies, so if another reference to the session, i.e., another
>> file descriptor exists, the session will
>>      only be effectively destroyed, once that reference disappears.
>>
>>      Of course, the kernel does close all file descriptor on process
>> termination, thus the associated sessions
>>      will eventually be destroyed.
>>
>>      In per-cpu mode, it is not necessary, though recommended, to be
>> on the monitored CPU to issue this call.
>>
>>
> -------------------------------------------------------------------------
>> Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
>> Studies have shown that voting for your favorite open source project,
>> along with a healthy diet, reduces your potential for chronic lameness
>> and boredom. Vote Now at http://www.sourceforge.net/community/cca08
>> _______________________________________________
>> perfmon2-devel mailing list
>> perfmon2-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>
>

-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Re: [perfmon2] perfmon2 syscall interface rationale

Reply via email to