Linux-Development-Sys Digest #251

Digestifier Wed, 01 Nov 2000 17:04:01 -0800
Linux-Development-Sys Digest #251, Volume #8      Wed, 1 Nov 00 20:13:08 EST

Contents:
  Supplying buffers to a device driver (Jeff Andre)
  Disk Replication (bootable) (Tom J)
  IDE performance measurements (Ravi Wijayaratne)
  Re: udp bind and sendto problem (Rick Ellis)
  Re: RFI: Linux implementation of tnf tracing system (Kaelin Colclasure)
  Re: Supplying buffers to a device driver (Pete Zaitcev)
  Re: microsecond-resolution timers (David Wragg)
  RFI: Linux port of tnf tracing system ("Kaelin Colclasure")

----------------------------------------------------------------------------

From: [EMAIL PROTECTED] (Jeff Andre)
Subject: Supplying buffers to a device driver
Date: 1 Nov 2000 17:28:49 GMT

I'm writing a device driver for a data acquisition card.  The card
has multiple channels and I'd like to make sure every channel has
a buffer available.

A Windows developer showed me the ReadFileEx() call.  It allows the
caller to specify a read with an I/O completion routine that's called
when the read completes.  I liked the concept and thought I'd try
something similar for Linux.

What I'd like to do is supply a channel with a number of buffers.
When a buffer becomes full, set the card to use the next buffer and
tell the application that the first buffer is full.  The card has a
scatter/gather capabilities so the card can write directly into the
user space (after the buffers have been prepared with the kiobuf
facility).  Use this eliminates double buffering.

When a buffer is full, the driver would add it to an queue and raise
a SIGIO/SIGUSR[1,2} signal.  The application would then issue an
ioctl() to see which buffer is full and process it.

The question I have is how to supply the buffers to the device driver.
The read() call could be but it seems wrong to lie to application by
taking it's buffer and holding on to it/changing it after the call no
matter what error code is returned.  Using an ioctl() call to supply
the buffer seems crude.

Any thoughts or comments?

Thanks,

Jeff Andre

------------------------------

From: [EMAIL PROTECTED] (Tom J)
Subject: Disk Replication (bootable)
Date: Wed, 1 Nov 2000 17:52:35 GMT

Hello.  How do you replicate bootable linux media, such as distribution
kits or just a bootable harddrive?  I have been ransacking the HOWTOs,
man pages, and lilo documentation but need a confirmation.
Under LynxOS (TM) , I'm starting to forget, but I could use diskcopy to
copy the drive block by block, and makeboot to make the boot sector.
The drives were identical removable media.
But as to linux, I saw another note here that showed that I may have to
1. partition the drive
2. make the file system 
3. use lilo.conf and lilo to install the boot block
4. I guess I just mount the disk and do a recursive copy of everything I want
on there.  
Is that about it?  Are there instructions in the howtos on making bootable
cdroms? 

Thanks
-- 
Tom J.; tej at world.std.com Massachusetts USA; MSCS; Systems Programmer
Dist. Real-Time Data Acquisition S/W for Science and Eng. under POSIX,
C, C++, X, Motif, Graphics, Audio  http://world.std.com/~tej

------------------------------

From: Ravi Wijayaratne <[EMAIL PROTECTED]>
Subject: IDE performance measurements
Date: Wed, 01 Nov 2000 07:22:34 -0600
Reply-To: [EMAIL PROTECTED]

Hi,

I am attempting to get a distribution of IDE disk access latencies vs
time.

I have a simple IDE configuration which means one drive and one hwif.

To do this I time stamped every request that goes through the
add_request routine (I accounted for
request coalescing too in make_request). In the hwif->drive sturcture I
constructed the data structure
to store the measured variables and created an IOCTL for ide to get the
measured parameters to user
space.

My question is this.
I measure the request processing latecies in ide_end_request. But the
figures I get are always less
than the number of bytes read (only reads) from the application. Also
the total time is much less than
the application time. This shows that I am not capturing all the bytes
that goes through the storage.

*Does request->nr_sectors contain the number of bytes per that request ?
Does it get updated some where ?
* Is there any other path the request takes besides ide_end_request (for
disk I/O) ?

Some help is greatly appreciated.



I measure the I/O latencies in ide_end_request as follows.
============= ox ==============================
 void ide_end_request(byte uptodate, ide_hwgroup_t *hwgroup)
 446 {
 447         struct request *rq;
 448         unsigned long flags;
 449
 450         spin_lock_irqsave(&io_request_lock, flags);
 451         rq = hwgroup->rq;
 452
 453         if (!end_that_request_first(rq, uptodate,
hwgroup->drive->name)) {
                        /*********************/
                         time_lapse = jiffies - rq->time_stamp
                         total_bytes = rq->nr_sectors

update(hwgroup->drive->performance_structs,time_lapse,total_bytes);
                        /*********************/
 454                 add_blkdev_randomness(MAJOR(rq->rq_dev));
 455                 hwgroup->drive->queue = rq->next;
 456                 blk_dev[MAJOR(rq->rq_dev)].current_request = NULL;
 457                 hwgroup->rq = NULL;
 458                 end_that_request_last(rq);
 459         }
 460         spin_unlock_irqrestore(&io_request_lock, flags);
 461 }
===================== ox=========================

The total bytes measures the total number of bytes that went to and
acame back from storage.
The cumulative figure is maintained in drive->performance_structs.


------------------------------

From: [EMAIL PROTECTED] (Rick Ellis)
Subject: Re: udp bind and sendto problem
Date: 1 Nov 2000 22:23:20 GMT

In article <8tf3a7$unh$[EMAIL PROTECTED]>,  <[EMAIL PROTECTED]> wrote:

>sorry i forgot to include that "little detail"
>errno is set to 22 which is EINVAL, im pretty sure.

Ok, what happens if you leave out the bind?

--
http://www.spinics.net/linux

------------------------------

From: Kaelin Colclasure <[EMAIL PROTECTED]>
Subject: Re: RFI: Linux implementation of tnf tracing system
Date: Wed, 01 Nov 2000 16:51:12 -0800

Andi Kleen wrote:
> 
> Kaelin Colclasure <[EMAIL PROTECTED]> writes:
> >
> >   1) Does something like this already exist for Linux?
> >
> >      My primary motivation for contemplating this is that I need the
> >      capability under Linux. The excuse/opportunity to delve into the
> >      arcane mysteries of kernel hacking is attractive -- but I do have
> >      a lot of other demands on my time right now...
> 
> There are at least three such facilities already released for Linux (and in
> addition undoubtedly a few more private ones, I've writen at least another
> one for the kernel): The IBM dprobes [unfortunately their implementation
> of user space probes has a few races left, but the kernel version works
> nicely -- can be found somewhere on oss.software.ibm.com], SGI's ktrace
> [minimal but useful multi CPU kernel tracer -- somewhere on oss.sgi.com],
> the Linux Trace Toolkit [kernel tracer with GUI frontend, no URL sorry]

Thanks for the excellent pointers! I've browsed to each of the three
implementations you mentioned (LTT is at http://www.opersys.com/LTT/).
Here is my off-the-cuff assessment of what I saw:

SGI's ktrace was the first I looked at, and is definately the most
minimalistic facility (not necessarily a bad thing). It only supports
tracing kernel-space code and its structure for an "event" is basically
four words: timestamp, event-code, arg1 and arg2. On the plus side, I
can't see how a more performant trace facility could be implemented. It
even uses per-CPU ring buffers with a user-space utility to merge the
trace files back together after the test run -- very cool.

IBM's dProbes were next. This is a really cool facility -- but its not
quite the same genre of tracing facility that I had in mind. It does
seem like it would be a terrific complement to a TNF facility, though.
dProbes lets you dynamically insert test probes into any
already-compiled executable, including kernel modules. These probes can
examine (and change!) the CPU registers -- you write them in a simple
RPN-like language. This sounds like a hacker's wet dream! :-) I've
downloaded the package and definately plan to spend some time with it
later.

The Linux Trace Toolkit (LTT), of all the packages, looks the closest to
the spirit of the TNF toolkit. However, it too appears to focus on the
kernel as the most interesting piece of software running on the system.
As an application developer, I reserve the right to think otherwise. ;-)
This facility can tell you a lot about your processes, of couse, by
telling you what system calls are made, and how they interact with other
events in the kernel. But in my (admittedly cursory) examination of the
Web pages I did not see any reference to any ability to instrument your
own user-space process. The fact that the trace runs for a predetermined
period of time, rather than the lifetime of a user-space process, also
argues that this is a tool set more tailored towards kernel work. The
GUI analysis tools are really cool, though.

To place these comments in the context of what I'm wanting from TNF, let
me give a similar brief description of the Solaris implementation:

The TNF facility basically lets you sprinkle TNF_PROBE macros all
through your source, wether destined for user- or kernel-space
execution. These probes are essentially dormant until the prex(1)
utility is used to enable them. When a process is run under prex, it
initially stops at a command prompt which allows you to enable the
specific probes you're interested in, or all probes. The probes
themselves can output an arbitrary number of arguments as part of the
trace event data. When combined with the TNF toolkit (distributed
separately), the raw trace information produced can be graphically
analyzed to produce charts of e.g. the latency distribution between two
arbitrary probe events. TNF probes are designed to be left in deployed
applications, so that "problem installations" can readily produce trace
files from production systems to assist in isolating obscure problems.

Miscellaneous other obesrvations:

All three of the existing Linux implementations require patching your
kernel -- although to be fair I believe the SGI ktrace could be reworked
as a pure loadable module fairly easily. I need something that can
readily be deployed / installed on an arbitrary running production
system. I also am primarily interested in userland events. My motivation
for a kernel-level facility is simply to get something as close to TNF
(in terms of 1] performance and 2] functionality) on Linux as possible.
(Okay and, I confess, to play around in the kernel a bit for my own
edification.)

On a development system, it seems like dProbes would be a terrific
complement to a TNF facility. The ability to add TNF probes to an
arbitrary executable (which you may not have source for, for instance)
would *really* be sweet.

> I would suggest using of these as base. dprobes looks most promising
> because it doesn't need any source changes, but it is a bit tiring to
> use currently.
> 
> >   2) What is the "right" way to interface clients to the facility?
> >      Can we use the same mechanism for user processes and kernel-space
> >      clients?
> 
> Probably not. In Kernel space you want to write directly into a buffer, but
> that's nasty to do from user space due to locking issues.
> 
> >      As I understand it, there are three alternatives for exposing an
> >      API to userland from a kernel module: 1) make the tracing
> >      facility a device driver, 2) hook into the /proc filesystem, or
> >      3) add a new system call to the kernel. I understand that these
> >      are not mutually exclusive. What I don't know are the trade-offs
> >      between these alternatives. There is a requirement that the (a?)
> 
> Device driver with an ioctl is probably the best.

Okay, I've been reading "Linux Device Drivers" (LDD) and I was leaning
that way myself. Good to hear. :-)

> Another possibility would be a shared memory segment
> with a ring buffer per CPU that is only used for user space processes
> (kernel needs a separate  buffer because it should be accessed from
> interrupts and that needs special locking) and does a simple user space l
> ocking. Advantage: you can trace without needing relatively costly system
> calls. The shared memory segment could be also maintained in the kernel
> by supplying an mmap operation to the device driver.

Uh, is that as complicated to implement as it sounds? :-) I'm thinking
the buffer-per-CPU thing ktrace does is a really good idea -- but
outside of kernel space how would I tell what CPU I'm running on?
Without a syscall? And remember, I'm only up to Chapter 3 of LDD! :-)

> >      Should I just forget the idea the kernel and user TNF_PROBES are
> >      going to be able to share the same implementation? Is there a
> >      better alternative that I don't know about?
> 
> It is certainly possible, but it would be probably more efficient to not
> do it.
> 
> >   3) Is it reasonable to kalloc a 4MB or larger trace buffer, or do I
> >      need to consider a more sophisticated buffering strategy?
> 
> You can vmalloc() such a buffer.

Wow, VM in the kernel! You guys are losing your sheen of studliness...
;-) Thanks for the pointer.

> >   4) What kind of high-resolution timers or performance counters are
> >      available from the Linux kernel?
> 
> What the CPU offers -- the time stamp counter or even more performance MSRs.

Ahh, now I have lots of obviously relevant example code to crib from.
Ya' gotta' love open source. :-)

> -Andi

-- Kaelin

------------------------------

From: [EMAIL PROTECTED] (Pete Zaitcev)
Subject: Re: Supplying buffers to a device driver
Date: Thu, 02 Nov 2000 01:01:53 GMT

>[...]
> What I'd like to do is supply a channel with a number of buffers.
> When a buffer becomes full, set the card to use the next buffer and
> tell the application that the first buffer is full.  The card has a
> scatter/gather capabilities so the card can write directly into the
> user space (after the buffers have been prepared with the kiobuf
> facility).  Use this eliminates double buffering.

First a minor nitpicking - I used to think that "double buffering"
is exactly the name of the technique that you describe in the
begining of the paragraph. It has no relation to the number of
times data are copied.

Secondly, look at the way bttv operates. It allocates a kernel
buffer (segmented with frames), then allows user to remap it with
mmap(). The VIDIOCSYNC blocks a user thread until a frame arrives,
syncs caches, and returns to the user. The rest is obvious...
This technique also eliminates copying, unless you need more
frames in flight than were allocated initially.

The bttv does not send you any signals. You have to have a thread
that blocks at VIDIOCSYNC. Once it's back from the ioctl it can
signal to someone else. Normally it is not a problem as
scalability of your hardware will kill you before the scalability
of multithreading.

Using kiobufs racks you hacker points but they are a strictly 2.4
feature.

--Pete

------------------------------

From: David Wragg <[EMAIL PROTECTED]>
Subject: Re: microsecond-resolution timers
Date: 01 Nov 2000 23:22:33 +0000

Kasper Dupont <[EMAIL PROTECTED]> writes:
> You cannot do that with standard PC hardware.

Actually, the local APIC on every Intel processor since the PPro (and
some of the Pentiums) can do this.  It has a timer with this kind of
resolution (the time base is the bus clock).

The local APIC is usually only used on SMP systems, and there were
apparently problems with the using it on some UP motherboards. But
recent kernels have an option to use the local APIC even on UP
systems, so perhaps more recent motherboards have fixed those
problems.

> I don't think any computer is cabable of
> handing an IRQ every microsecond.

For the systems with a working local APIC, it would be nice to get rid
of the clock interrupt (or perhaps just use it for calibration
purposes), and use the APIC for triggering all kernel timers.  It can
be set to trigger an interrupt after an arbitrary period, avoiding the
limitations of a fixed frequency clock.  But supporting this facility
to the kernel would be a lot of work.

> But there
> might exist special hardware that can be
> programmed to send an IRQ in n microseconds.
> Of course as always the process may at any
> time be delayed any amount of time due to
> scheduling. If you don't want any interrupts
> or signals byt just want to read the timer
> the rdtscl() sounds like the best solution
> with standard PC hardware.

Yes.  The local APIC has great potential but its general use is
probably some way off.

David Wragg

------------------------------

From: "Kaelin Colclasure" <[EMAIL PROTECTED]>
Subject: RFI: Linux port of tnf tracing system
Date: Tue, 31 Oct 2000 08:52:29 +0800

I am contemplating an attempt to build a clone of the Solaris 2.x tnf
facilities for Linux.
---8<---
Miscellaneous Library Functions                       tracing(3X)

NAME
     tracing - overview of tnf tracing system

DESCRIPTION
     tnf tracing is a set of programs and API's that can be  used
     to  present  a high-level view of the performance of an exe-
     cutable, a library, or part of the kernel. tracing  is  used
     to  analyze  a program's performance and identify the condi-
     tions that produced a bug.
...
--->8---

See the tracing(3X) Solaris man page for an overview of tnf.

The general architecture I have in mind is as follows:

  1) A kernel module(?) will be developed to implement the trace
     buffer. This module will allocate a fixed-size buffer in
     kernel-space when a trace is active. It will provide an API for
     both kernel-space clients and user processes to add information
     to this trace buffer. It will provide a user process the ability
     to read/dump the trace buffer contents in their raw (binary)
     form.

  2) A C interface to the kernel module will be developed. It will be
     compatible with the Solaris TNF_PROBE(3X) macros, so that
     programs may portably use TNF between Linux and Solaris. See the
     TNF_PROBE(3X) Solaris man page for details.

  3) A prex(1) work-alike will be developed to provide the interface
     for controlling the trace facility from the command line. See the
     prex(1) Solaris man page for details.

  4) A tnfdump(1) replacement will be developed. It will process raw
     (binary) trace data into an ASCII form suitable for examination
     or further analysis. As an extension to the Solaris version, it
     will have an option to produce an XML version of the trace
     information for easier post-processing.

That's the concept... But as we all know only too well the devil lives
in the details. And since this is my first attempt at kernel hacking,
don't anybody hold their breath waiting for a 0.1 release. And of
course, I have a number of questions:

  1) Does something like this already exist for Linux?

     My primary motivation for contemplating this is that I need the
     capability under Linux. The excuse/opportunity to delve into the
     arcane mysteries of kernel hacking is attractive -- but I do have
     a lot of other demands on my time right now...

  2) What is the "right" way to interface clients to the facility?
     Can we use the same mechanism for user processes and kernel-space
     clients?

     As I understand it, there are three alternatives for exposing an
     API to userland from a kernel module: 1) make the tracing
     facility a device driver, 2) hook into the /proc filesystem, or
     3) add a new system call to the kernel. I understand that these
     are not mutually exclusive. What I don't know are the trade-offs
     between these alternatives. There is a requirement that the (a?)
     mechanism must be accessible from both kernel- and user-space.
     I'm reasonably sure a new system call would fit the bill, but
     from what research I've done so far it seems that system calls
     can't be implemented in loadable modules. On the other hand, I'd
     be very surprised if kernel-space entities have ready access to
     the /proc filesystem or entries in the /dev tree.

     Should I just forget the idea the kernel and user TNF_PROBES are
     going to be able to share the same implementation? Is there a
     better alternative that I don't know about?

  3) Is it reasonable to kalloc a 4MB or larger trace buffer, or do I
     need to consider a more sophisticated buffering strategy?

     Solaris uses a 4MB buffer by default, and I've found it adequate
     for most purposes. But is this a safe amount of RAM to be
     grabbing in kernel mode? Naturally this would only be allocated
     when a trace was active, but I don't want the tnf facility to
     have a negative impact on Linux's performance. Note that I'm not
     really concerned about machines with inadequate RAM.

  4) What kind of high-resolution timers or performance counters are
     available from the Linux kernel?

     Performance tuning is tnf's forte -- but to do it effectively it
     needs to be able to measure the time elapsed during e.g. an
     individual system call handled by the kernel. And of course the
     call to read the time can't be expensive. Some sort of access to
     the Pentium's performace counters would probably be ideal -- but
     would make the code i86-specific. In principal I'd like to be
     portable, but in practice I personally only run Linux on i86
     boxen right now...

-- Kaelin

------------------------------


** FOR YOUR REFERENCE **

The service address, to which questions about the list itself and requests
to be added to or deleted from it should be directed, is:

    Internet: [EMAIL PROTECTED]

You can send mail to the entire list (and comp.os.linux.development.system) via:

    Internet: [EMAIL PROTECTED]

Linux may be obtained via one of these FTP sites:
    ftp.funet.fi                                pub/Linux
    tsx-11.mit.edu                              pub/linux
    sunsite.unc.edu                             pub/Linux

End of Linux-Development-System Digest
******************************
Linux-Development-Sys Digest #251

Reply via email to