Linux-Development-Sys Digest #251, Volume #8 Wed, 1 Nov 00 20:13:08 EST
Contents:
Supplying buffers to a device driver (Jeff Andre)
Disk Replication (bootable) (Tom J)
IDE performance measurements (Ravi Wijayaratne)
Re: udp bind and sendto problem (Rick Ellis)
Re: RFI: Linux implementation of tnf tracing system (Kaelin Colclasure)
Re: Supplying buffers to a device driver (Pete Zaitcev)
Re: microsecond-resolution timers (David Wragg)
RFI: Linux port of tnf tracing system ("Kaelin Colclasure")
----------------------------------------------------------------------------
From: [EMAIL PROTECTED] (Jeff Andre)
Subject: Supplying buffers to a device driver
Date: 1 Nov 2000 17:28:49 GMT
I'm writing a device driver for a data acquisition card. The card
has multiple channels and I'd like to make sure every channel has
a buffer available.
A Windows developer showed me the ReadFileEx() call. It allows the
caller to specify a read with an I/O completion routine that's called
when the read completes. I liked the concept and thought I'd try
something similar for Linux.
What I'd like to do is supply a channel with a number of buffers.
When a buffer becomes full, set the card to use the next buffer and
tell the application that the first buffer is full. The card has a
scatter/gather capabilities so the card can write directly into the
user space (after the buffers have been prepared with the kiobuf
facility). Use this eliminates double buffering.
When a buffer is full, the driver would add it to an queue and raise
a SIGIO/SIGUSR[1,2} signal. The application would then issue an
ioctl() to see which buffer is full and process it.
The question I have is how to supply the buffers to the device driver.
The read() call could be but it seems wrong to lie to application by
taking it's buffer and holding on to it/changing it after the call no
matter what error code is returned. Using an ioctl() call to supply
the buffer seems crude.
Any thoughts or comments?
Thanks,
Jeff Andre
------------------------------
From: [EMAIL PROTECTED] (Tom J)
Subject: Disk Replication (bootable)
Date: Wed, 1 Nov 2000 17:52:35 GMT
Hello. How do you replicate bootable linux media, such as distribution
kits or just a bootable harddrive? I have been ransacking the HOWTOs,
man pages, and lilo documentation but need a confirmation.
Under LynxOS (TM) , I'm starting to forget, but I could use diskcopy to
copy the drive block by block, and makeboot to make the boot sector.
The drives were identical removable media.
But as to linux, I saw another note here that showed that I may have to
1. partition the drive
2. make the file system
3. use lilo.conf and lilo to install the boot block
4. I guess I just mount the disk and do a recursive copy of everything I want
on there.
Is that about it? Are there instructions in the howtos on making bootable
cdroms?
Thanks
--
Tom J.; tej at world.std.com Massachusetts USA; MSCS; Systems Programmer
Dist. Real-Time Data Acquisition S/W for Science and Eng. under POSIX,
C, C++, X, Motif, Graphics, Audio http://world.std.com/~tej
------------------------------
From: Ravi Wijayaratne <[EMAIL PROTECTED]>
Subject: IDE performance measurements
Date: Wed, 01 Nov 2000 07:22:34 -0600
Reply-To: [EMAIL PROTECTED]
Hi,
I am attempting to get a distribution of IDE disk access latencies vs
time.
I have a simple IDE configuration which means one drive and one hwif.
To do this I time stamped every request that goes through the
add_request routine (I accounted for
request coalescing too in make_request). In the hwif->drive sturcture I
constructed the data structure
to store the measured variables and created an IOCTL for ide to get the
measured parameters to user
space.
My question is this.
I measure the request processing latecies in ide_end_request. But the
figures I get are always less
than the number of bytes read (only reads) from the application. Also
the total time is much less than
the application time. This shows that I am not capturing all the bytes
that goes through the storage.
*Does request->nr_sectors contain the number of bytes per that request ?
Does it get updated some where ?
* Is there any other path the request takes besides ide_end_request (for
disk I/O) ?
Some help is greatly appreciated.
I measure the I/O latencies in ide_end_request as follows.
============= ox ==============================
void ide_end_request(byte uptodate, ide_hwgroup_t *hwgroup)
446 {
447 struct request *rq;
448 unsigned long flags;
449
450 spin_lock_irqsave(&io_request_lock, flags);
451 rq = hwgroup->rq;
452
453 if (!end_that_request_first(rq, uptodate,
hwgroup->drive->name)) {
/*********************/
time_lapse = jiffies - rq->time_stamp
total_bytes = rq->nr_sectors
update(hwgroup->drive->performance_structs,time_lapse,total_bytes);
/*********************/
454 add_blkdev_randomness(MAJOR(rq->rq_dev));
455 hwgroup->drive->queue = rq->next;
456 blk_dev[MAJOR(rq->rq_dev)].current_request = NULL;
457 hwgroup->rq = NULL;
458 end_that_request_last(rq);
459 }
460 spin_unlock_irqrestore(&io_request_lock, flags);
461 }
===================== ox=========================
The total bytes measures the total number of bytes that went to and
acame back from storage.
The cumulative figure is maintained in drive->performance_structs.
------------------------------
From: [EMAIL PROTECTED] (Rick Ellis)
Subject: Re: udp bind and sendto problem
Date: 1 Nov 2000 22:23:20 GMT
In article <8tf3a7$unh$[EMAIL PROTECTED]>, <[EMAIL PROTECTED]> wrote:
>sorry i forgot to include that "little detail"
>errno is set to 22 which is EINVAL, im pretty sure.
Ok, what happens if you leave out the bind?
--
http://www.spinics.net/linux
------------------------------
From: Kaelin Colclasure <[EMAIL PROTECTED]>
Subject: Re: RFI: Linux implementation of tnf tracing system
Date: Wed, 01 Nov 2000 16:51:12 -0800
Andi Kleen wrote:
>
> Kaelin Colclasure <[EMAIL PROTECTED]> writes:
> >
> > 1) Does something like this already exist for Linux?
> >
> > My primary motivation for contemplating this is that I need the
> > capability under Linux. The excuse/opportunity to delve into the
> > arcane mysteries of kernel hacking is attractive -- but I do have
> > a lot of other demands on my time right now...
>
> There are at least three such facilities already released for Linux (and in
> addition undoubtedly a few more private ones, I've writen at least another
> one for the kernel): The IBM dprobes [unfortunately their implementation
> of user space probes has a few races left, but the kernel version works
> nicely -- can be found somewhere on oss.software.ibm.com], SGI's ktrace
> [minimal but useful multi CPU kernel tracer -- somewhere on oss.sgi.com],
> the Linux Trace Toolkit [kernel tracer with GUI frontend, no URL sorry]
Thanks for the excellent pointers! I've browsed to each of the three
implementations you mentioned (LTT is at http://www.opersys.com/LTT/).
Here is my off-the-cuff assessment of what I saw:
SGI's ktrace was the first I looked at, and is definately the most
minimalistic facility (not necessarily a bad thing). It only supports
tracing kernel-space code and its structure for an "event" is basically
four words: timestamp, event-code, arg1 and arg2. On the plus side, I
can't see how a more performant trace facility could be implemented. It
even uses per-CPU ring buffers with a user-space utility to merge the
trace files back together after the test run -- very cool.
IBM's dProbes were next. This is a really cool facility -- but its not
quite the same genre of tracing facility that I had in mind. It does
seem like it would be a terrific complement to a TNF facility, though.
dProbes lets you dynamically insert test probes into any
already-compiled executable, including kernel modules. These probes can
examine (and change!) the CPU registers -- you write them in a simple
RPN-like language. This sounds like a hacker's wet dream! :-) I've
downloaded the package and definately plan to spend some time with it
later.
The Linux Trace Toolkit (LTT), of all the packages, looks the closest to
the spirit of the TNF toolkit. However, it too appears to focus on the
kernel as the most interesting piece of software running on the system.
As an application developer, I reserve the right to think otherwise. ;-)
This facility can tell you a lot about your processes, of couse, by
telling you what system calls are made, and how they interact with other
events in the kernel. But in my (admittedly cursory) examination of the
Web pages I did not see any reference to any ability to instrument your
own user-space process. The fact that the trace runs for a predetermined
period of time, rather than the lifetime of a user-space process, also
argues that this is a tool set more tailored towards kernel work. The
GUI analysis tools are really cool, though.
To place these comments in the context of what I'm wanting from TNF, let
me give a similar brief description of the Solaris implementation:
The TNF facility basically lets you sprinkle TNF_PROBE macros all
through your source, wether destined for user- or kernel-space
execution. These probes are essentially dormant until the prex(1)
utility is used to enable them. When a process is run under prex, it
initially stops at a command prompt which allows you to enable the
specific probes you're interested in, or all probes. The probes
themselves can output an arbitrary number of arguments as part of the
trace event data. When combined with the TNF toolkit (distributed
separately), the raw trace information produced can be graphically
analyzed to produce charts of e.g. the latency distribution between two
arbitrary probe events. TNF probes are designed to be left in deployed
applications, so that "problem installations" can readily produce trace
files from production systems to assist in isolating obscure problems.
Miscellaneous other obesrvations:
All three of the existing Linux implementations require patching your
kernel -- although to be fair I believe the SGI ktrace could be reworked
as a pure loadable module fairly easily. I need something that can
readily be deployed / installed on an arbitrary running production
system. I also am primarily interested in userland events. My motivation
for a kernel-level facility is simply to get something as close to TNF
(in terms of 1] performance and 2] functionality) on Linux as possible.
(Okay and, I confess, to play around in the kernel a bit for my own
edification.)
On a development system, it seems like dProbes would be a terrific
complement to a TNF facility. The ability to add TNF probes to an
arbitrary executable (which you may not have source for, for instance)
would *really* be sweet.
> I would suggest using of these as base. dprobes looks most promising
> because it doesn't need any source changes, but it is a bit tiring to
> use currently.
>
> > 2) What is the "right" way to interface clients to the facility?
> > Can we use the same mechanism for user processes and kernel-space
> > clients?
>
> Probably not. In Kernel space you want to write directly into a buffer, but
> that's nasty to do from user space due to locking issues.
>
> > As I understand it, there are three alternatives for exposing an
> > API to userland from a kernel module: 1) make the tracing
> > facility a device driver, 2) hook into the /proc filesystem, or
> > 3) add a new system call to the kernel. I understand that these
> > are not mutually exclusive. What I don't know are the trade-offs
> > between these alternatives. There is a requirement that the (a?)
>
> Device driver with an ioctl is probably the best.
Okay, I've been reading "Linux Device Drivers" (LDD) and I was leaning
that way myself. Good to hear. :-)
> Another possibility would be a shared memory segment
> with a ring buffer per CPU that is only used for user space processes
> (kernel needs a separate buffer because it should be accessed from
> interrupts and that needs special locking) and does a simple user space l
> ocking. Advantage: you can trace without needing relatively costly system
> calls. The shared memory segment could be also maintained in the kernel
> by supplying an mmap operation to the device driver.
Uh, is that as complicated to implement as it sounds? :-) I'm thinking
the buffer-per-CPU thing ktrace does is a really good idea -- but
outside of kernel space how would I tell what CPU I'm running on?
Without a syscall? And remember, I'm only up to Chapter 3 of LDD! :-)
> > Should I just forget the idea the kernel and user TNF_PROBES are
> > going to be able to share the same implementation? Is there a
> > better alternative that I don't know about?
>
> It is certainly possible, but it would be probably more efficient to not
> do it.
>
> > 3) Is it reasonable to kalloc a 4MB or larger trace buffer, or do I
> > need to consider a more sophisticated buffering strategy?
>
> You can vmalloc() such a buffer.
Wow, VM in the kernel! You guys are losing your sheen of studliness...
;-) Thanks for the pointer.
> > 4) What kind of high-resolution timers or performance counters are
> > available from the Linux kernel?
>
> What the CPU offers -- the time stamp counter or even more performance MSRs.
Ahh, now I have lots of obviously relevant example code to crib from.
Ya' gotta' love open source. :-)
> -Andi
-- Kaelin
------------------------------
From: [EMAIL PROTECTED] (Pete Zaitcev)
Subject: Re: Supplying buffers to a device driver
Date: Thu, 02 Nov 2000 01:01:53 GMT
>[...]
> What I'd like to do is supply a channel with a number of buffers.
> When a buffer becomes full, set the card to use the next buffer and
> tell the application that the first buffer is full. The card has a
> scatter/gather capabilities so the card can write directly into the
> user space (after the buffers have been prepared with the kiobuf
> facility). Use this eliminates double buffering.
First a minor nitpicking - I used to think that "double buffering"
is exactly the name of the technique that you describe in the
begining of the paragraph. It has no relation to the number of
times data are copied.
Secondly, look at the way bttv operates. It allocates a kernel
buffer (segmented with frames), then allows user to remap it with
mmap(). The VIDIOCSYNC blocks a user thread until a frame arrives,
syncs caches, and returns to the user. The rest is obvious...
This technique also eliminates copying, unless you need more
frames in flight than were allocated initially.
The bttv does not send you any signals. You have to have a thread
that blocks at VIDIOCSYNC. Once it's back from the ioctl it can
signal to someone else. Normally it is not a problem as
scalability of your hardware will kill you before the scalability
of multithreading.
Using kiobufs racks you hacker points but they are a strictly 2.4
feature.
--Pete
------------------------------
From: David Wragg <[EMAIL PROTECTED]>
Subject: Re: microsecond-resolution timers
Date: 01 Nov 2000 23:22:33 +0000
Kasper Dupont <[EMAIL PROTECTED]> writes:
> You cannot do that with standard PC hardware.
Actually, the local APIC on every Intel processor since the PPro (and
some of the Pentiums) can do this. It has a timer with this kind of
resolution (the time base is the bus clock).
The local APIC is usually only used on SMP systems, and there were
apparently problems with the using it on some UP motherboards. But
recent kernels have an option to use the local APIC even on UP
systems, so perhaps more recent motherboards have fixed those
problems.
> I don't think any computer is cabable of
> handing an IRQ every microsecond.
For the systems with a working local APIC, it would be nice to get rid
of the clock interrupt (or perhaps just use it for calibration
purposes), and use the APIC for triggering all kernel timers. It can
be set to trigger an interrupt after an arbitrary period, avoiding the
limitations of a fixed frequency clock. But supporting this facility
to the kernel would be a lot of work.
> But there
> might exist special hardware that can be
> programmed to send an IRQ in n microseconds.
> Of course as always the process may at any
> time be delayed any amount of time due to
> scheduling. If you don't want any interrupts
> or signals byt just want to read the timer
> the rdtscl() sounds like the best solution
> with standard PC hardware.
Yes. The local APIC has great potential but its general use is
probably some way off.
David Wragg
------------------------------
From: "Kaelin Colclasure" <[EMAIL PROTECTED]>
Subject: RFI: Linux port of tnf tracing system
Date: Tue, 31 Oct 2000 08:52:29 +0800
I am contemplating an attempt to build a clone of the Solaris 2.x tnf
facilities for Linux.
---8<---
Miscellaneous Library Functions tracing(3X)
NAME
tracing - overview of tnf tracing system
DESCRIPTION
tnf tracing is a set of programs and API's that can be used
to present a high-level view of the performance of an exe-
cutable, a library, or part of the kernel. tracing is used
to analyze a program's performance and identify the condi-
tions that produced a bug.
...
--->8---
See the tracing(3X) Solaris man page for an overview of tnf.
The general architecture I have in mind is as follows:
1) A kernel module(?) will be developed to implement the trace
buffer. This module will allocate a fixed-size buffer in
kernel-space when a trace is active. It will provide an API for
both kernel-space clients and user processes to add information
to this trace buffer. It will provide a user process the ability
to read/dump the trace buffer contents in their raw (binary)
form.
2) A C interface to the kernel module will be developed. It will be
compatible with the Solaris TNF_PROBE(3X) macros, so that
programs may portably use TNF between Linux and Solaris. See the
TNF_PROBE(3X) Solaris man page for details.
3) A prex(1) work-alike will be developed to provide the interface
for controlling the trace facility from the command line. See the
prex(1) Solaris man page for details.
4) A tnfdump(1) replacement will be developed. It will process raw
(binary) trace data into an ASCII form suitable for examination
or further analysis. As an extension to the Solaris version, it
will have an option to produce an XML version of the trace
information for easier post-processing.
That's the concept... But as we all know only too well the devil lives
in the details. And since this is my first attempt at kernel hacking,
don't anybody hold their breath waiting for a 0.1 release. And of
course, I have a number of questions:
1) Does something like this already exist for Linux?
My primary motivation for contemplating this is that I need the
capability under Linux. The excuse/opportunity to delve into the
arcane mysteries of kernel hacking is attractive -- but I do have
a lot of other demands on my time right now...
2) What is the "right" way to interface clients to the facility?
Can we use the same mechanism for user processes and kernel-space
clients?
As I understand it, there are three alternatives for exposing an
API to userland from a kernel module: 1) make the tracing
facility a device driver, 2) hook into the /proc filesystem, or
3) add a new system call to the kernel. I understand that these
are not mutually exclusive. What I don't know are the trade-offs
between these alternatives. There is a requirement that the (a?)
mechanism must be accessible from both kernel- and user-space.
I'm reasonably sure a new system call would fit the bill, but
from what research I've done so far it seems that system calls
can't be implemented in loadable modules. On the other hand, I'd
be very surprised if kernel-space entities have ready access to
the /proc filesystem or entries in the /dev tree.
Should I just forget the idea the kernel and user TNF_PROBES are
going to be able to share the same implementation? Is there a
better alternative that I don't know about?
3) Is it reasonable to kalloc a 4MB or larger trace buffer, or do I
need to consider a more sophisticated buffering strategy?
Solaris uses a 4MB buffer by default, and I've found it adequate
for most purposes. But is this a safe amount of RAM to be
grabbing in kernel mode? Naturally this would only be allocated
when a trace was active, but I don't want the tnf facility to
have a negative impact on Linux's performance. Note that I'm not
really concerned about machines with inadequate RAM.
4) What kind of high-resolution timers or performance counters are
available from the Linux kernel?
Performance tuning is tnf's forte -- but to do it effectively it
needs to be able to measure the time elapsed during e.g. an
individual system call handled by the kernel. And of course the
call to read the time can't be expensive. Some sort of access to
the Pentium's performace counters would probably be ideal -- but
would make the code i86-specific. In principal I'd like to be
portable, but in practice I personally only run Linux on i86
boxen right now...
-- Kaelin
------------------------------
** FOR YOUR REFERENCE **
The service address, to which questions about the list itself and requests
to be added to or deleted from it should be directed, is:
Internet: [EMAIL PROTECTED]
You can send mail to the entire list (and comp.os.linux.development.system) via:
Internet: [EMAIL PROTECTED]
Linux may be obtained via one of these FTP sites:
ftp.funet.fi pub/Linux
tsx-11.mit.edu pub/linux
sunsite.unc.edu pub/Linux
End of Linux-Development-System Digest
******************************