Klaus:

Thanks for the analysis and debugging.  Unfortunately I am in the middle of several big projects and cannot take time to examine this in the kind of depth that it will require.

Here are some thoughts and observations.

I too have noticed that strtst runs slowly.  I have been attributing it to the length delays to allow the syslog process to catch up.  Try running "strtst -d0" and see if it runs faster.

I found when running with -d0 that some tests failed on a 4 CPU system that I was using.  I attributed this to races that were not present with the debugging turned on.  So I inserted some delays into strtst to ensure that a message sent on one fid had time to traverse the loop driver to the other before being read.  So these delays may stretch out the running time for strtst as well.

You are correct about the lis_in_syscall variable.  It no longer seems useful.  Some day I will probably take it out.

I went to runq process only in order to ensure that service procedures would always be run at task level, never at interrupt level or at BH level.  The newer kernels are very restrictive as to what you can and cannot do from interrupt level.  For example, if you always have to allocate memory as "atomic" it means that if the kernel needs to mess with its page table in order to satisfy your request then the request will fail.  But the same request made at "kernel" level would succeed because it could afford to wait for page table shuffling.

It is also why I went to semaphores instead of spin locks to protect queue structures.  That way service procedures are entered with LiS holding no spin locks.  This allows the service procedure to call kernel routines that "might sleep".  Obviously you don't want to do anything in a service procedure that would sleep for very long, but calling the memory allocator is pretty innocent.

The LiS queue run processes are supposed to run at real-time priority.  This is a priority band above any other processes.  So a wakeup time of 220 ms is huge in that context.  I have done some independent timing of thread wake-up times in user space using pthreads and have found times on the order of 20 micro seconds on a fast machine.

If we were to put a struct timeval into the semaphore structure then, on a debug option, we could use lis_gettimeoday() to timestamp the semaphore at "up" time and then compute the latency in the "down" routine.  We could keep a global histogram of these latencies and report them back via an ioctl.  That would give an idea of what the modal wakeup time is.  Perhaps the 0.22 sec is a one-of-a-kind event, or nearly so.

That's about all I can do at this point.

-- Dave

At 02:30 AM 7/29/2004, Golbach, Klaus wrote:
This is the second try, the last mail was too long.

Hello LiS-Comunity,
I tested with LiS-2.17-R on a Intel (*86) mono and two-processormachine.
The Linux version is Suse SLES8.
I called the strtst and (depending on amount of debug-loggings, type of
machine and other conditions, which I do not know,) the test sometimes
runs well, but mostly failed at one of "band_tests".
On the two processor machine sometimes it was not possible to unload
streams and I had to reboot the machine.

Looking at the logging in messages, I had the feeling, that the
appropriate service routines were not called in time.
I changed the last option, asked in the Configure phase, to n and added
some debugging-instructions to prove, if this is correct. (You can find
the extensions at the end as diff)

To debug in this way is not nice, because:
- Adding or removing printk-instructions does change the behaviour to
any direction: worse or better.
- You can never be sure, that some important entries are not lost.
So I can not present "reasonable traces" from all the errors, that
happened and can reproduce the same error only with some luck.

In the appended Traces, which are reasonable good to read, I added also
remarks to bring your attention to the points, which seemed important
for me.
The result:
The message sent down is not processed in time. Generally it takes a
long time until lis_thread_runqueues is waked up.
In worst cases it took up to 0,22 seconds from lis_up to scheduling of
lis_thread_runqueues!
Guess something is wrong with the priority of this LiS-RunQ-process.
Dont know how to prove or make it better.

I grepped for the keyword mregparm in all files under /usr/src/ to
exclude the mentioned problems with parameter passing. Did find it only
in comments.

So please can You give me some proposals how to go on with that problem.
May be, You know more about this problem.
Other proposals are also welcome, because I am new at Linux.

I also tested with other LiS-versions:
LiS-2.16:
No troubles until now. There lis_run_queues is called mostly direct, so
there is no delay.
But we have to use LiS-2.17/18, because we need the feature "per-driver
locking option in LiS"
LiS-2.17-H:
The same problems as in LiS-2.17-R. Additional the two processor machine
panics. Also some minor problems.

I changed lis_setqsched at the LiS-2.17* versions (see switch
CALL_DIRECT in linux-mdep.c) and called lis_run_queues in most cases
directly. Strtst did not fail.
By the way: strtst lasts 161 seconds in old way and 95 seconds with
direct call.

This is my emergency exit (but do not feel good about):
Changing to direct call, will change the behaviour and will be a work
around for a while until going into strong tests.
But it is not the real solution. It will not work, if in
interruptcontext, lis_in_syscall==1, can_call=0.
Additional Dave changed his strategy away from direct call and I do not
know the reasons why and if this flags: lis_in_syscall, can_call will be
set correct in this and future versions of LiS, since they are no more
necessary for his new strategy in lis_setqsched.
What do You think about?

best regards
   Klaus Golbach

Reply via email to