Re: [perfmon2] perfmon2 syscall interface rationale

Corey J Ashford Tue, 01 Jul 2008 10:00:12 -0700

Hi Stephane,

You might want to add a separate section that details the thinking about 
why you don't want to use a single, multiplexed syscall.  If you add this, 
it could go after you've detailed the session breakdown, and before you've 
described the current syscalls.  I know this has been an area some of the 
LKML folks have picked at before.


I will read this in more detail later.

- Corey

[EMAIL PROTECTED] wrote on 07/01/2008 09:41:36 
AM:

> Hello everyone,
> 
> I intend to send this following description to LKML and a few LKML 
developers
> to try and explain the reasoning behind the current syscall interface
> for perfmon2.
> 
> I know there have been a lot of doubts and misunderstandings as to why 
we need
> to many syscalls and how they could be extended. I tried to address
> those concerns
> here.
> 
> Please feel free to comment, add to it.
> 
> Thanks.
> 
> 
-----------------------------------------------------------------------------------------------------------------------
> 
> 1) monitoring session breakdown
> 
> A monitoring session can be decomposed into a sequence of fundamental
> actions which
> are as follows:
>        - create the session
>        - program registers
>        - attach to target thread or CPU
>        - start monitoring
>        - stop monitoring
>        - read results
>        - detach from thread or CPU
>        - terminate session
> 
> The order may not necessarily be like shown. For instance, the
> programming may happen
> after the session has been attached. Obviously, the start/stop 
> operations may be
> repeated before results are read and results can be read multiple times.
> 
> In the next sections, we examine each action separately.
> 
> 2) session creation
> 
>   Perfmon2 supports 2 types of sessions: per-thread or per-CPU (so
> called system-wide)
> 
>   During the creation of the session, certain attributes are set, they
> remain until the
>   session is terminated. For instance, the per-cpu attribute cannot 
> be changed.
> 
>   During creation, the kernel state to support the session is
> allocated and initialized.
>   No PMU hardware is actually accessed. Permissions to create a
> session may be checked.
>   Resource limits are also validated and memory consumption is accounted 
for.
> 
>   The software state of the PMU is initialized, i.e., all
> configuration registers are
>   set to a quiescent value. Data registers are initialized to zero
> whenever possible.
> 
>   Upon return, the kernel returns a unique identifier which is to be
> used for all
>   subsequent actions on the session.
> 
> 3) programming the registers
> 
>   Programming of the PMU registers can occur at any time during the
> lifetime of a session,
>   the session does not need to be attached to a thread of CPU.
> 
>   It may be necessary to change the settings, e.g., monitor another
> event or reset the counts
>   when sampling at the user level. Thus, the writing of the registers
> MUST be decoupled from
>   the creation of the session.
> 
>   Similarly, writing of configuration and data registers must also be
> decoupled, as data
>   registers may be reprogrammed independently of their configuration
> registers, like when
>   sampling for instance.
> 
>   The number of registers varies a lot from one PMU to the other. The
> relationships between
>   configuration and data registers can be more complex than just
> one-to-one. On most PMU,
>   writing of the PMU registers requires running at the most privileged
> level, i.e., in the
>   kernel. To amortize the cost of a system call, it is interesting to
> be able to program multiple
>   registers in one call. Thus, it must be possible to pass vector
> arguments. Of course,
>   for security reasons, the system administrator may impose a limit on
> how big vectors can
>   actually be. The advantage is that vector can vary in size and thus
> the amount of data
>   passed between application and kernel can be optimized to be just
> the minimal needed.
>   System call data needs to be copied into the kernel memory space
> before it can be used.
> 
> 4) attachment and detachment
> 
>  A session can be attached to a kernel-visible thread or a CPU. If
> there is attachment,
>  then it must be possible to detach the session to possibly re-attach
> it to another thread
>  or CPU. Detachment should not require destroying the session.
> 
>  There are 3 possibilities for attachment:
>        - when the session is created
>        - when the monitoring is activated
>        - with a dedicated call
> 
>  If the attachment is done during the creation of the session, then it
> means the target (thread or CPU)
>  needs to exist at that time. For a cpu-wide session, this means that
> the session must be created while
>  executing on that CPU. This does not seem unreasonable especially on
> NUMA systems.
> 
>  For a per-thread session however, this is a bit more problematic as
> this means it is not possible
>  to prepare the session and the PMU registers before the thread
> exists. When monitoring across fork
>  and pthread_create, it is important to minimize overhead. Creation of
> a session can trigger complex
>  memory allocations in the kernel. Thus, it may be interesting to
> prepare a batch of ready-to-go sessions,
>  which just need to be attached when the fork or pthread_create
> notification arrives.
> 
>  If the attachment is coupled with the creation of the session, it
> implies that the detachment is coupled
>  with its destruction, by symmetry. Coupling of detachment with
> termination is problematic for both per-thread
>  and CPU-wide mode. With the former, the termination of a thread is
> usually totally asynchronous with the
>  termination of the session by the monitoring tool. The only case
> where they are synchronized is for
>  self-monitored threads. When a tool is monitoring a thread in another
> process, the termination of that thread
>  will cause the kernel to detach the session. But the session must not
> be closed because the tool likely wants
>  to read the results and also because the session still exists for the
> tool. For CPU-wide, there is also an issue
>  when a monitored CPU is put off-line dynamically. The session would
> be detached by the kernel, yet the session would
>  still be live in the tool whose controlling thread would have been
> migrated off of that CPU.
> 
>  If the attachment is done when monitoring is activated, then the
> detachment is done when monitoring
>  is deactivated. The following relationships are therefore enforced:
> 
>        attached => activated
>        stopped  => detached
> 
>  It is expected that start/stop operations could be very frequent for
> self-monitored workloads. When used
>  to monitor small sections of critical code, e.g., loop kernels, it is
> important to minimize overhead, thus
>  the start/stop should be as simple as possible.
> 
>  Attaching requires loading the PMU machine state onto the PMU
> hardware. Conversely, detaching implies flushing
>  the PMU state to memory so results can be read even after the
> termination of a thread, for instance.  Both
>  operations are expensive due to the high cost of accessing the PMU 
registers.
> 
>  Furthermore, there are certain PMU models, e.g., Intel Itanium, where
> it is possible to let user level code
>  start/stop monitoring with a single instruction. To minimize
> overhead, it is very important to allow this
>  mechanism for self-monitored programs. Yet the session would have to
> be attached/detached somehow. With
>  dedicated attach/detach calls, this can be supported transparently.
> One possible work-around with the coupled
>  calls would be to require a system call to attach the session and do
> the initial activation, subsequent
>  start/stop could use the lightweight instruction. The session would
> be stopped and detached with a system call.
> 
>  The dedicated attach/detach calls offer a maximum level of
> flexibility. The let applications create sessions
>  in advance or on-demand. The actions on the session, start/stop and
> attach/detach, are perfectly symmetrical.
>  The termination of the monitored target can cause its detachment, but
> the session remains accessible. Issuing
>  of the detach call on a session already detached by the kernel is 
harmless.
> 
>  The cost of start/stop is not impacted.
> 
>  The following properties are enforced:
>        upon attachment   => monitoring stopped
>        during detachment => monitoring stopped
> 
> 5) start and stop
> 
>  It must be possible for an application to start and stop monitoring
> at will and at any moment.
>  Start and stop can be called very frequently and not just at the
> beginning and end of a session.
>  This is especially likely for self-monitored threads where it is
> customary to monitor execution of
>  only one function or loop. Thus those operations can be on the
> critical path and they must therefore
>  by as lightweight as possible. See the discussion in the section
> about attachment and detachment.
> 
> 
> 6) reading the results
> 
>  The results are extracted by reading the PMU registers containing
> data (as opposed to configuration).
>  The number of registers of interest can vary based on the PMU model,
> the type of measurement, the events
>  measured.
> 
>  Reading can occur at regular interval, e.g., time-based user level
> sampling, and can therefore be on the
>  critical path. Thus it must as lightweight as possible. Given that
> the cost of dominated by the latency
>  of accessing the PMU registers, it is important to only read the
> registers that are used. Thus, the call
>  must provide vector arguments just like for the calls to program the 
PMU.
> 
>  It must be possible to read the registers while the session is
> detached but also when it is attached to a
>  thread or CPU.
> 
> 7) termination
> 
>  Termination of a session means all the associated resources are
> either released to the free pool or destroyed.
>  After termination, no state remains. Termination implies, stopping
> monitoring and detaching the session if
>  necessary.
> 
>  For the purpose of termination, one has to differentiate between the
> monitored entity and the controlling entity.
>  When a tool monitors a thread in another process, all the threads
> from the tool are controlling entities, and the
>  monitored thread is the monitored entity. Any entity can vanish at any 
time.
> 
>  If the monitored entity terminates voluntarily, i.e., normal exit, or
> involuntarily, e.g., core dump, the kernel
>  simply detaches the session but it is not destroyed.
> 
>  Until the last controlling entity disappears, the session remains 
accessible.
> 
>  There are situations where all the controlling entities disappear
> before the monitored entity. In this case, the
>  session becomes useless, results cannot be extracted, thus the
> session enters the zombie state. It will
>  eventually be detached and its resources will be reclaimed by the
> kernel, i.e., the session will be terminated.
> 
> 8) extensibility
> 
>   There is already a vast diversity with existing PMU models, this is
> unlikely to change, quite to the contrary
>   it is envisioned that the PMU will become a true valid-add and that
> vendors will therefore try to differentiate
>   one from the other. Moreover, the PMU will remain closely tied to
> the underlying micro-architecture. Therefore,
>   it is very important to ensure that the monitoring interface will be
> able to adapt easily to future PMU models
>   and their extended features, i.e., what is offered beyond counting 
events.
> 
>   It is important to realize that extensibility is not limited to
> supporting more PMU registers. It also includes
>   supporting advanced sampling features or socket-level PMUs as
> opposed to just core-level PMUs.
> 
>   It may be necessary to extend the system calls with new generic or
> architecture specific parameters, and this
>   without simply adding new system calls.
> 
> 9) current perfmon2 interface
> 
>   The perfmon2 interface design is guided by the principles described
> in the previous sections.
>   We now explain each call is details.
> 
> 
>   a) session creation
> 
>      int pfm_create_session(struct pfarg_ctx *ctx, char *smpl_name,
> void *smpl_arg, size_t arg_size);
> 
>      The function creates the perfmon session and returns a file
> descriptor used to manipulate the session
>      thereafter.
> 
>      The calls takes several parameters which are as follows:
>         - pfarg_ctx: encapsulates all session parameters (see below)
>         - smpl_name: used when sampling to designate which format to use
>         - smpl_arg:  point to format-specific arguments
>         - smpl_size:  size of the structure passed in smpl_arg
> 
>      The pfarg_ctx structure is defined as follows:
>         - flags: generic and arch-specific flags for the session
>         - reserved: reserved for future extensions
> 
>      To provide for future extensions, the pfarg_ctx structure
> contains reserved fields. Reserved fields
>      must be zeroed.
> 
>      To create a per-cpu session, the value PFM_CTX_SYSTEM_WIDE must
> be passed in flags.
> 
>      When in-kernel sampling is not used smpl_name, smpl_arg, arg_size
> must be 0.
> 
>   b) programming the registers
> 
>      int pfm_write_pmcs(int fd, struct pfarg_pmc *pmcs, int n);
>      int pfm_write_pmds(int fd, struct pfarg_pmd *pmds, int n);
> 
>      The calls are provided to program the configuration and data
> registers respectively. The parameters are
>      as follows:
>         - fd: file descriptor identifying the session
>         - pmc: pointer to parg_pmc structures
>         - pmd: pointer to parg_pmd structures
>         - n : number of elements in the pmc or pmd vector
> 
>      It is possible to pass vector of parg_pmc or pfarg_pmd registers.
> The minimal size is 1, maximum size is
>      determined by system administrator.
> 
>      The pfarg_pmc structure is defined as follows:
>      struct pfarg_pmc {
>         u16 reg_num;
>         u64 reg_value;
>         u64 reserved[];
>      };
> 
>      The pfarg_pmd structure is defined as follows:
>      struct pfarg_pmd {
>         u16 reg_num;
>         u64 reg_value;
>         u64 reserved[];
>      };
> 
>      Although both structures are currently identical, they will
> differ as more functionalities are added so better
>      to create two versions from the start.
> 
>      Provisions for extensions are provided by the reserved field in
> each structure.
> 
> 
>   c) attachment and detachment
> 
>      int pfm_load_context(int fd, struct pfarg_load *ld);
>      int pfm_unload_context(int fd);
> 
> 
>      The session is identified by the file descriptor, fd.
> 
>      To attach, the targeted thread or CPU must be provided. For
> extensibility purposes, the target is passed in
>      in structure which is defined as follows:
>      struct pfarg_load {
>         u32 target;
>         u64 reserved[];
>      };
>      In per-thread mode, the target field must be set to the kernel
> thread identification (gettid()).
> 
>      In per-cpu mode, the target field must be set to the logical CPU
> identification as seen by the kernel.
>      Furthermore, the caller must be running on the CPU to monitor
> otherwise the call fails.
> 
>      Extensions can be implemented using the reserved field.
> 
> 
>   d) start and stop
> 
>      int pfm_start(int fd);
>      int pfm_stop(int fd);
> 
>      The session is identified by the file descriptor fd.
> 
>      Currently no other parameters are supported for those calls.
> 
> 
>    e) reading results
> 
>      int pfm_read_pmds(int fd, struct pfarg_pmd *pmds, int n);
> 
> 
>      The session is identified by the file descriptor fd.
> 
>      Just like for programming the registers, it is possible to pass
> vectors of structures in pmds. The number
>      of elements is passed in n.
> 
> 
>    f) termination
> 
>      int close(fd);
> 
>      To terminate a session, the file descriptor has to be closed. The
> semantics of file descriptor sharing
>      applies, so if another reference to the session, i.e., another
> file descriptor exists, the session will
>      only be effectively destroyed, once that reference disappears.
> 
>      Of course, the kernel does close all file descriptor on process
> termination, thus the associated sessions
>      will eventually be destroyed.
> 
>      In per-cpu mode, it is not necessary, though recommended, to be
> on the monitored CPU to issue this call.
> 
> 
-------------------------------------------------------------------------
> Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
> Studies have shown that voting for your favorite open source project,
> along with a healthy diet, reduces your potential for chronic lameness
> and boredom. Vote Now at http://www.sourceforge.net/community/cca08
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel


-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Re: [perfmon2] perfmon2 syscall interface rationale

Reply via email to