Corey, On Tue, Jul 1, 2008 at 6:58 PM, Corey J Ashford <[EMAIL PROTECTED]> wrote: > You might want to add a separate section that details the thinking about > why you don't want to use a single, multiplexed syscall. If you add this, > it could go after you've detailed the session breakdown, and before you've > described the current syscalls. I know this has been an area some of the > LKML folks have picked at before. >
That's a good point. I will add a paragraph about this. But I was thinking that they would probably not oppose a mixed of the two. several syscalls including one with multiplexing. For instance, I could envision a pfm_controls() which could be used to start/stop attach/detach and leave the other calls untouched. > [EMAIL PROTECTED] wrote on 07/01/2008 09:41:36 > AM: > >> Hello everyone, >> >> I intend to send this following description to LKML and a few LKML > developers >> to try and explain the reasoning behind the current syscall interface >> for perfmon2. >> >> I know there have been a lot of doubts and misunderstandings as to why > we need >> to many syscalls and how they could be extended. I tried to address >> those concerns >> here. >> >> Please feel free to comment, add to it. >> >> Thanks. >> >> > ----------------------------------------------------------------------------------------------------------------------- >> >> 1) monitoring session breakdown >> >> A monitoring session can be decomposed into a sequence of fundamental >> actions which >> are as follows: >> - create the session >> - program registers >> - attach to target thread or CPU >> - start monitoring >> - stop monitoring >> - read results >> - detach from thread or CPU >> - terminate session >> >> The order may not necessarily be like shown. For instance, the >> programming may happen >> after the session has been attached. Obviously, the start/stop >> operations may be >> repeated before results are read and results can be read multiple times. >> >> In the next sections, we examine each action separately. >> >> 2) session creation >> >> Perfmon2 supports 2 types of sessions: per-thread or per-CPU (so >> called system-wide) >> >> During the creation of the session, certain attributes are set, they >> remain until the >> session is terminated. For instance, the per-cpu attribute cannot >> be changed. >> >> During creation, the kernel state to support the session is >> allocated and initialized. >> No PMU hardware is actually accessed. Permissions to create a >> session may be checked. >> Resource limits are also validated and memory consumption is accounted > for. >> >> The software state of the PMU is initialized, i.e., all >> configuration registers are >> set to a quiescent value. Data registers are initialized to zero >> whenever possible. >> >> Upon return, the kernel returns a unique identifier which is to be >> used for all >> subsequent actions on the session. >> >> 3) programming the registers >> >> Programming of the PMU registers can occur at any time during the >> lifetime of a session, >> the session does not need to be attached to a thread of CPU. >> >> It may be necessary to change the settings, e.g., monitor another >> event or reset the counts >> when sampling at the user level. Thus, the writing of the registers >> MUST be decoupled from >> the creation of the session. >> >> Similarly, writing of configuration and data registers must also be >> decoupled, as data >> registers may be reprogrammed independently of their configuration >> registers, like when >> sampling for instance. >> >> The number of registers varies a lot from one PMU to the other. The >> relationships between >> configuration and data registers can be more complex than just >> one-to-one. On most PMU, >> writing of the PMU registers requires running at the most privileged >> level, i.e., in the >> kernel. To amortize the cost of a system call, it is interesting to >> be able to program multiple >> registers in one call. Thus, it must be possible to pass vector >> arguments. Of course, >> for security reasons, the system administrator may impose a limit on >> how big vectors can >> actually be. The advantage is that vector can vary in size and thus >> the amount of data >> passed between application and kernel can be optimized to be just >> the minimal needed. >> System call data needs to be copied into the kernel memory space >> before it can be used. >> >> 4) attachment and detachment >> >> A session can be attached to a kernel-visible thread or a CPU. If >> there is attachment, >> then it must be possible to detach the session to possibly re-attach >> it to another thread >> or CPU. Detachment should not require destroying the session. >> >> There are 3 possibilities for attachment: >> - when the session is created >> - when the monitoring is activated >> - with a dedicated call >> >> If the attachment is done during the creation of the session, then it >> means the target (thread or CPU) >> needs to exist at that time. For a cpu-wide session, this means that >> the session must be created while >> executing on that CPU. This does not seem unreasonable especially on >> NUMA systems. >> >> For a per-thread session however, this is a bit more problematic as >> this means it is not possible >> to prepare the session and the PMU registers before the thread >> exists. When monitoring across fork >> and pthread_create, it is important to minimize overhead. Creation of >> a session can trigger complex >> memory allocations in the kernel. Thus, it may be interesting to >> prepare a batch of ready-to-go sessions, >> which just need to be attached when the fork or pthread_create >> notification arrives. >> >> If the attachment is coupled with the creation of the session, it >> implies that the detachment is coupled >> with its destruction, by symmetry. Coupling of detachment with >> termination is problematic for both per-thread >> and CPU-wide mode. With the former, the termination of a thread is >> usually totally asynchronous with the >> termination of the session by the monitoring tool. The only case >> where they are synchronized is for >> self-monitored threads. When a tool is monitoring a thread in another >> process, the termination of that thread >> will cause the kernel to detach the session. But the session must not >> be closed because the tool likely wants >> to read the results and also because the session still exists for the >> tool. For CPU-wide, there is also an issue >> when a monitored CPU is put off-line dynamically. The session would >> be detached by the kernel, yet the session would >> still be live in the tool whose controlling thread would have been >> migrated off of that CPU. >> >> If the attachment is done when monitoring is activated, then the >> detachment is done when monitoring >> is deactivated. The following relationships are therefore enforced: >> >> attached => activated >> stopped => detached >> >> It is expected that start/stop operations could be very frequent for >> self-monitored workloads. When used >> to monitor small sections of critical code, e.g., loop kernels, it is >> important to minimize overhead, thus >> the start/stop should be as simple as possible. >> >> Attaching requires loading the PMU machine state onto the PMU >> hardware. Conversely, detaching implies flushing >> the PMU state to memory so results can be read even after the >> termination of a thread, for instance. Both >> operations are expensive due to the high cost of accessing the PMU > registers. >> >> Furthermore, there are certain PMU models, e.g., Intel Itanium, where >> it is possible to let user level code >> start/stop monitoring with a single instruction. To minimize >> overhead, it is very important to allow this >> mechanism for self-monitored programs. Yet the session would have to >> be attached/detached somehow. With >> dedicated attach/detach calls, this can be supported transparently. >> One possible work-around with the coupled >> calls would be to require a system call to attach the session and do >> the initial activation, subsequent >> start/stop could use the lightweight instruction. The session would >> be stopped and detached with a system call. >> >> The dedicated attach/detach calls offer a maximum level of >> flexibility. The let applications create sessions >> in advance or on-demand. The actions on the session, start/stop and >> attach/detach, are perfectly symmetrical. >> The termination of the monitored target can cause its detachment, but >> the session remains accessible. Issuing >> of the detach call on a session already detached by the kernel is > harmless. >> >> The cost of start/stop is not impacted. >> >> The following properties are enforced: >> upon attachment => monitoring stopped >> during detachment => monitoring stopped >> >> 5) start and stop >> >> It must be possible for an application to start and stop monitoring >> at will and at any moment. >> Start and stop can be called very frequently and not just at the >> beginning and end of a session. >> This is especially likely for self-monitored threads where it is >> customary to monitor execution of >> only one function or loop. Thus those operations can be on the >> critical path and they must therefore >> by as lightweight as possible. See the discussion in the section >> about attachment and detachment. >> >> >> 6) reading the results >> >> The results are extracted by reading the PMU registers containing >> data (as opposed to configuration). >> The number of registers of interest can vary based on the PMU model, >> the type of measurement, the events >> measured. >> >> Reading can occur at regular interval, e.g., time-based user level >> sampling, and can therefore be on the >> critical path. Thus it must as lightweight as possible. Given that >> the cost of dominated by the latency >> of accessing the PMU registers, it is important to only read the >> registers that are used. Thus, the call >> must provide vector arguments just like for the calls to program the > PMU. >> >> It must be possible to read the registers while the session is >> detached but also when it is attached to a >> thread or CPU. >> >> 7) termination >> >> Termination of a session means all the associated resources are >> either released to the free pool or destroyed. >> After termination, no state remains. Termination implies, stopping >> monitoring and detaching the session if >> necessary. >> >> For the purpose of termination, one has to differentiate between the >> monitored entity and the controlling entity. >> When a tool monitors a thread in another process, all the threads >> from the tool are controlling entities, and the >> monitored thread is the monitored entity. Any entity can vanish at any > time. >> >> If the monitored entity terminates voluntarily, i.e., normal exit, or >> involuntarily, e.g., core dump, the kernel >> simply detaches the session but it is not destroyed. >> >> Until the last controlling entity disappears, the session remains > accessible. >> >> There are situations where all the controlling entities disappear >> before the monitored entity. In this case, the >> session becomes useless, results cannot be extracted, thus the >> session enters the zombie state. It will >> eventually be detached and its resources will be reclaimed by the >> kernel, i.e., the session will be terminated. >> >> 8) extensibility >> >> There is already a vast diversity with existing PMU models, this is >> unlikely to change, quite to the contrary >> it is envisioned that the PMU will become a true valid-add and that >> vendors will therefore try to differentiate >> one from the other. Moreover, the PMU will remain closely tied to >> the underlying micro-architecture. Therefore, >> it is very important to ensure that the monitoring interface will be >> able to adapt easily to future PMU models >> and their extended features, i.e., what is offered beyond counting > events. >> >> It is important to realize that extensibility is not limited to >> supporting more PMU registers. It also includes >> supporting advanced sampling features or socket-level PMUs as >> opposed to just core-level PMUs. >> >> It may be necessary to extend the system calls with new generic or >> architecture specific parameters, and this >> without simply adding new system calls. >> >> 9) current perfmon2 interface >> >> The perfmon2 interface design is guided by the principles described >> in the previous sections. >> We now explain each call is details. >> >> >> a) session creation >> >> int pfm_create_session(struct pfarg_ctx *ctx, char *smpl_name, >> void *smpl_arg, size_t arg_size); >> >> The function creates the perfmon session and returns a file >> descriptor used to manipulate the session >> thereafter. >> >> The calls takes several parameters which are as follows: >> - pfarg_ctx: encapsulates all session parameters (see below) >> - smpl_name: used when sampling to designate which format to use >> - smpl_arg: point to format-specific arguments >> - smpl_size: size of the structure passed in smpl_arg >> >> The pfarg_ctx structure is defined as follows: >> - flags: generic and arch-specific flags for the session >> - reserved: reserved for future extensions >> >> To provide for future extensions, the pfarg_ctx structure >> contains reserved fields. Reserved fields >> must be zeroed. >> >> To create a per-cpu session, the value PFM_CTX_SYSTEM_WIDE must >> be passed in flags. >> >> When in-kernel sampling is not used smpl_name, smpl_arg, arg_size >> must be 0. >> >> b) programming the registers >> >> int pfm_write_pmcs(int fd, struct pfarg_pmc *pmcs, int n); >> int pfm_write_pmds(int fd, struct pfarg_pmd *pmds, int n); >> >> The calls are provided to program the configuration and data >> registers respectively. The parameters are >> as follows: >> - fd: file descriptor identifying the session >> - pmc: pointer to parg_pmc structures >> - pmd: pointer to parg_pmd structures >> - n : number of elements in the pmc or pmd vector >> >> It is possible to pass vector of parg_pmc or pfarg_pmd registers. >> The minimal size is 1, maximum size is >> determined by system administrator. >> >> The pfarg_pmc structure is defined as follows: >> struct pfarg_pmc { >> u16 reg_num; >> u64 reg_value; >> u64 reserved[]; >> }; >> >> The pfarg_pmd structure is defined as follows: >> struct pfarg_pmd { >> u16 reg_num; >> u64 reg_value; >> u64 reserved[]; >> }; >> >> Although both structures are currently identical, they will >> differ as more functionalities are added so better >> to create two versions from the start. >> >> Provisions for extensions are provided by the reserved field in >> each structure. >> >> >> c) attachment and detachment >> >> int pfm_load_context(int fd, struct pfarg_load *ld); >> int pfm_unload_context(int fd); >> >> >> The session is identified by the file descriptor, fd. >> >> To attach, the targeted thread or CPU must be provided. For >> extensibility purposes, the target is passed in >> in structure which is defined as follows: >> struct pfarg_load { >> u32 target; >> u64 reserved[]; >> }; >> In per-thread mode, the target field must be set to the kernel >> thread identification (gettid()). >> >> In per-cpu mode, the target field must be set to the logical CPU >> identification as seen by the kernel. >> Furthermore, the caller must be running on the CPU to monitor >> otherwise the call fails. >> >> Extensions can be implemented using the reserved field. >> >> >> d) start and stop >> >> int pfm_start(int fd); >> int pfm_stop(int fd); >> >> The session is identified by the file descriptor fd. >> >> Currently no other parameters are supported for those calls. >> >> >> e) reading results >> >> int pfm_read_pmds(int fd, struct pfarg_pmd *pmds, int n); >> >> >> The session is identified by the file descriptor fd. >> >> Just like for programming the registers, it is possible to pass >> vectors of structures in pmds. The number >> of elements is passed in n. >> >> >> f) termination >> >> int close(fd); >> >> To terminate a session, the file descriptor has to be closed. The >> semantics of file descriptor sharing >> applies, so if another reference to the session, i.e., another >> file descriptor exists, the session will >> only be effectively destroyed, once that reference disappears. >> >> Of course, the kernel does close all file descriptor on process >> termination, thus the associated sessions >> will eventually be destroyed. >> >> In per-cpu mode, it is not necessary, though recommended, to be >> on the monitored CPU to issue this call. >> >> > ------------------------------------------------------------------------- >> Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! >> Studies have shown that voting for your favorite open source project, >> along with a healthy diet, reduces your potential for chronic lameness >> and boredom. Vote Now at http://www.sourceforge.net/community/cca08 >> _______________________________________________ >> perfmon2-devel mailing list >> perfmon2-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel > > ------------------------------------------------------------------------- Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! Studies have shown that voting for your favorite open source project, along with a healthy diet, reduces your potential for chronic lameness and boredom. Vote Now at http://www.sourceforge.net/community/cca08 _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel