Hi Stephane, You might want to add a separate section that details the thinking about why you don't want to use a single, multiplexed syscall. If you add this, it could go after you've detailed the session breakdown, and before you've described the current syscalls. I know this has been an area some of the LKML folks have picked at before.
I will read this in more detail later. - Corey [EMAIL PROTECTED] wrote on 07/01/2008 09:41:36 AM: > Hello everyone, > > I intend to send this following description to LKML and a few LKML developers > to try and explain the reasoning behind the current syscall interface > for perfmon2. > > I know there have been a lot of doubts and misunderstandings as to why we need > to many syscalls and how they could be extended. I tried to address > those concerns > here. > > Please feel free to comment, add to it. > > Thanks. > > ----------------------------------------------------------------------------------------------------------------------- > > 1) monitoring session breakdown > > A monitoring session can be decomposed into a sequence of fundamental > actions which > are as follows: > - create the session > - program registers > - attach to target thread or CPU > - start monitoring > - stop monitoring > - read results > - detach from thread or CPU > - terminate session > > The order may not necessarily be like shown. For instance, the > programming may happen > after the session has been attached. Obviously, the start/stop > operations may be > repeated before results are read and results can be read multiple times. > > In the next sections, we examine each action separately. > > 2) session creation > > Perfmon2 supports 2 types of sessions: per-thread or per-CPU (so > called system-wide) > > During the creation of the session, certain attributes are set, they > remain until the > session is terminated. For instance, the per-cpu attribute cannot > be changed. > > During creation, the kernel state to support the session is > allocated and initialized. > No PMU hardware is actually accessed. Permissions to create a > session may be checked. > Resource limits are also validated and memory consumption is accounted for. > > The software state of the PMU is initialized, i.e., all > configuration registers are > set to a quiescent value. Data registers are initialized to zero > whenever possible. > > Upon return, the kernel returns a unique identifier which is to be > used for all > subsequent actions on the session. > > 3) programming the registers > > Programming of the PMU registers can occur at any time during the > lifetime of a session, > the session does not need to be attached to a thread of CPU. > > It may be necessary to change the settings, e.g., monitor another > event or reset the counts > when sampling at the user level. Thus, the writing of the registers > MUST be decoupled from > the creation of the session. > > Similarly, writing of configuration and data registers must also be > decoupled, as data > registers may be reprogrammed independently of their configuration > registers, like when > sampling for instance. > > The number of registers varies a lot from one PMU to the other. The > relationships between > configuration and data registers can be more complex than just > one-to-one. On most PMU, > writing of the PMU registers requires running at the most privileged > level, i.e., in the > kernel. To amortize the cost of a system call, it is interesting to > be able to program multiple > registers in one call. Thus, it must be possible to pass vector > arguments. Of course, > for security reasons, the system administrator may impose a limit on > how big vectors can > actually be. The advantage is that vector can vary in size and thus > the amount of data > passed between application and kernel can be optimized to be just > the minimal needed. > System call data needs to be copied into the kernel memory space > before it can be used. > > 4) attachment and detachment > > A session can be attached to a kernel-visible thread or a CPU. If > there is attachment, > then it must be possible to detach the session to possibly re-attach > it to another thread > or CPU. Detachment should not require destroying the session. > > There are 3 possibilities for attachment: > - when the session is created > - when the monitoring is activated > - with a dedicated call > > If the attachment is done during the creation of the session, then it > means the target (thread or CPU) > needs to exist at that time. For a cpu-wide session, this means that > the session must be created while > executing on that CPU. This does not seem unreasonable especially on > NUMA systems. > > For a per-thread session however, this is a bit more problematic as > this means it is not possible > to prepare the session and the PMU registers before the thread > exists. When monitoring across fork > and pthread_create, it is important to minimize overhead. Creation of > a session can trigger complex > memory allocations in the kernel. Thus, it may be interesting to > prepare a batch of ready-to-go sessions, > which just need to be attached when the fork or pthread_create > notification arrives. > > If the attachment is coupled with the creation of the session, it > implies that the detachment is coupled > with its destruction, by symmetry. Coupling of detachment with > termination is problematic for both per-thread > and CPU-wide mode. With the former, the termination of a thread is > usually totally asynchronous with the > termination of the session by the monitoring tool. The only case > where they are synchronized is for > self-monitored threads. When a tool is monitoring a thread in another > process, the termination of that thread > will cause the kernel to detach the session. But the session must not > be closed because the tool likely wants > to read the results and also because the session still exists for the > tool. For CPU-wide, there is also an issue > when a monitored CPU is put off-line dynamically. The session would > be detached by the kernel, yet the session would > still be live in the tool whose controlling thread would have been > migrated off of that CPU. > > If the attachment is done when monitoring is activated, then the > detachment is done when monitoring > is deactivated. The following relationships are therefore enforced: > > attached => activated > stopped => detached > > It is expected that start/stop operations could be very frequent for > self-monitored workloads. When used > to monitor small sections of critical code, e.g., loop kernels, it is > important to minimize overhead, thus > the start/stop should be as simple as possible. > > Attaching requires loading the PMU machine state onto the PMU > hardware. Conversely, detaching implies flushing > the PMU state to memory so results can be read even after the > termination of a thread, for instance. Both > operations are expensive due to the high cost of accessing the PMU registers. > > Furthermore, there are certain PMU models, e.g., Intel Itanium, where > it is possible to let user level code > start/stop monitoring with a single instruction. To minimize > overhead, it is very important to allow this > mechanism for self-monitored programs. Yet the session would have to > be attached/detached somehow. With > dedicated attach/detach calls, this can be supported transparently. > One possible work-around with the coupled > calls would be to require a system call to attach the session and do > the initial activation, subsequent > start/stop could use the lightweight instruction. The session would > be stopped and detached with a system call. > > The dedicated attach/detach calls offer a maximum level of > flexibility. The let applications create sessions > in advance or on-demand. The actions on the session, start/stop and > attach/detach, are perfectly symmetrical. > The termination of the monitored target can cause its detachment, but > the session remains accessible. Issuing > of the detach call on a session already detached by the kernel is harmless. > > The cost of start/stop is not impacted. > > The following properties are enforced: > upon attachment => monitoring stopped > during detachment => monitoring stopped > > 5) start and stop > > It must be possible for an application to start and stop monitoring > at will and at any moment. > Start and stop can be called very frequently and not just at the > beginning and end of a session. > This is especially likely for self-monitored threads where it is > customary to monitor execution of > only one function or loop. Thus those operations can be on the > critical path and they must therefore > by as lightweight as possible. See the discussion in the section > about attachment and detachment. > > > 6) reading the results > > The results are extracted by reading the PMU registers containing > data (as opposed to configuration). > The number of registers of interest can vary based on the PMU model, > the type of measurement, the events > measured. > > Reading can occur at regular interval, e.g., time-based user level > sampling, and can therefore be on the > critical path. Thus it must as lightweight as possible. Given that > the cost of dominated by the latency > of accessing the PMU registers, it is important to only read the > registers that are used. Thus, the call > must provide vector arguments just like for the calls to program the PMU. > > It must be possible to read the registers while the session is > detached but also when it is attached to a > thread or CPU. > > 7) termination > > Termination of a session means all the associated resources are > either released to the free pool or destroyed. > After termination, no state remains. Termination implies, stopping > monitoring and detaching the session if > necessary. > > For the purpose of termination, one has to differentiate between the > monitored entity and the controlling entity. > When a tool monitors a thread in another process, all the threads > from the tool are controlling entities, and the > monitored thread is the monitored entity. Any entity can vanish at any time. > > If the monitored entity terminates voluntarily, i.e., normal exit, or > involuntarily, e.g., core dump, the kernel > simply detaches the session but it is not destroyed. > > Until the last controlling entity disappears, the session remains accessible. > > There are situations where all the controlling entities disappear > before the monitored entity. In this case, the > session becomes useless, results cannot be extracted, thus the > session enters the zombie state. It will > eventually be detached and its resources will be reclaimed by the > kernel, i.e., the session will be terminated. > > 8) extensibility > > There is already a vast diversity with existing PMU models, this is > unlikely to change, quite to the contrary > it is envisioned that the PMU will become a true valid-add and that > vendors will therefore try to differentiate > one from the other. Moreover, the PMU will remain closely tied to > the underlying micro-architecture. Therefore, > it is very important to ensure that the monitoring interface will be > able to adapt easily to future PMU models > and their extended features, i.e., what is offered beyond counting events. > > It is important to realize that extensibility is not limited to > supporting more PMU registers. It also includes > supporting advanced sampling features or socket-level PMUs as > opposed to just core-level PMUs. > > It may be necessary to extend the system calls with new generic or > architecture specific parameters, and this > without simply adding new system calls. > > 9) current perfmon2 interface > > The perfmon2 interface design is guided by the principles described > in the previous sections. > We now explain each call is details. > > > a) session creation > > int pfm_create_session(struct pfarg_ctx *ctx, char *smpl_name, > void *smpl_arg, size_t arg_size); > > The function creates the perfmon session and returns a file > descriptor used to manipulate the session > thereafter. > > The calls takes several parameters which are as follows: > - pfarg_ctx: encapsulates all session parameters (see below) > - smpl_name: used when sampling to designate which format to use > - smpl_arg: point to format-specific arguments > - smpl_size: size of the structure passed in smpl_arg > > The pfarg_ctx structure is defined as follows: > - flags: generic and arch-specific flags for the session > - reserved: reserved for future extensions > > To provide for future extensions, the pfarg_ctx structure > contains reserved fields. Reserved fields > must be zeroed. > > To create a per-cpu session, the value PFM_CTX_SYSTEM_WIDE must > be passed in flags. > > When in-kernel sampling is not used smpl_name, smpl_arg, arg_size > must be 0. > > b) programming the registers > > int pfm_write_pmcs(int fd, struct pfarg_pmc *pmcs, int n); > int pfm_write_pmds(int fd, struct pfarg_pmd *pmds, int n); > > The calls are provided to program the configuration and data > registers respectively. The parameters are > as follows: > - fd: file descriptor identifying the session > - pmc: pointer to parg_pmc structures > - pmd: pointer to parg_pmd structures > - n : number of elements in the pmc or pmd vector > > It is possible to pass vector of parg_pmc or pfarg_pmd registers. > The minimal size is 1, maximum size is > determined by system administrator. > > The pfarg_pmc structure is defined as follows: > struct pfarg_pmc { > u16 reg_num; > u64 reg_value; > u64 reserved[]; > }; > > The pfarg_pmd structure is defined as follows: > struct pfarg_pmd { > u16 reg_num; > u64 reg_value; > u64 reserved[]; > }; > > Although both structures are currently identical, they will > differ as more functionalities are added so better > to create two versions from the start. > > Provisions for extensions are provided by the reserved field in > each structure. > > > c) attachment and detachment > > int pfm_load_context(int fd, struct pfarg_load *ld); > int pfm_unload_context(int fd); > > > The session is identified by the file descriptor, fd. > > To attach, the targeted thread or CPU must be provided. For > extensibility purposes, the target is passed in > in structure which is defined as follows: > struct pfarg_load { > u32 target; > u64 reserved[]; > }; > In per-thread mode, the target field must be set to the kernel > thread identification (gettid()). > > In per-cpu mode, the target field must be set to the logical CPU > identification as seen by the kernel. > Furthermore, the caller must be running on the CPU to monitor > otherwise the call fails. > > Extensions can be implemented using the reserved field. > > > d) start and stop > > int pfm_start(int fd); > int pfm_stop(int fd); > > The session is identified by the file descriptor fd. > > Currently no other parameters are supported for those calls. > > > e) reading results > > int pfm_read_pmds(int fd, struct pfarg_pmd *pmds, int n); > > > The session is identified by the file descriptor fd. > > Just like for programming the registers, it is possible to pass > vectors of structures in pmds. The number > of elements is passed in n. > > > f) termination > > int close(fd); > > To terminate a session, the file descriptor has to be closed. The > semantics of file descriptor sharing > applies, so if another reference to the session, i.e., another > file descriptor exists, the session will > only be effectively destroyed, once that reference disappears. > > Of course, the kernel does close all file descriptor on process > termination, thus the associated sessions > will eventually be destroyed. > > In per-cpu mode, it is not necessary, though recommended, to be > on the monitored CPU to issue this call. > > ------------------------------------------------------------------------- > Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! > Studies have shown that voting for your favorite open source project, > along with a healthy diet, reduces your potential for chronic lameness > and boredom. Vote Now at http://www.sourceforge.net/community/cca08 > _______________________________________________ > perfmon2-devel mailing list > perfmon2-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel ------------------------------------------------------------------------- Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW! Studies have shown that voting for your favorite open source project, along with a healthy diet, reduces your potential for chronic lameness and boredom. Vote Now at http://www.sourceforge.net/community/cca08 _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel