Re: Linux's implementation of poll() not scalable?

2000-11-01 Thread Dan Kegel

Mike Jagdis wrote:
> This patch firstly extends the wait queue mechanism
> to allow an arbitrary action to be performed. Then I rewrote
> the select/poll implementation to use event queueing to avoid
> rescanning descriptors that had not changed - and restructured
> the loops to be rather more efficient. This approach doesn't
> need any changes to driver poll routines, it doesn't need
> backwards mapping struct files. ...
>   Performance graphs and the lmbench derived test programs I
> used are at http://www.purplet.demon.co.uk/linux/select/ ...
>   Oh, and I updated this patch for 2.4.0-test9.

I can't wait to run my benchmark on it... hope I can get to it soon.
BTW, can you update that web page to also point to your patch?
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-11-01 Thread Dan Kegel

Mike Jagdis wrote:
 This patch firstly extends the wait queue mechanism
 to allow an arbitrary action to be performed. Then I rewrote
 the select/poll implementation to use event queueing to avoid
 rescanning descriptors that had not changed - and restructured
 the loops to be rather more efficient. This approach doesn't
 need any changes to driver poll routines, it doesn't need
 backwards mapping struct files. ...
   Performance graphs and the lmbench derived test programs I
 used are at http://www.purplet.demon.co.uk/linux/select/ ...
   Oh, and I updated this patch for 2.4.0-test9.

I can't wait to run my benchmark on it... hope I can get to it soon.
BTW, can you update that web page to also point to your patch?
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Linux's implementation of poll() not scalable?

2000-10-30 Thread Mike Jagdis

Here's something I did last year and then put on ice, partly
through lack of time and partly because I thought I'd pick
it up for 2.5.

  All this talk of event queues misses one thing: we already
have an event queue mechanism. They're called wait queues.
The only problem is that the only on-event action possible
is to wake the process (assuming it was asleep in the first
place). This patch firstly extends the wait queue mechanism
to allow an arbitrary action to be performed. Then I rewrote
the select/poll implementation to use event queueing to avoid
rescanning descriptors that had not changed - and restructured
the loops to be rather more efficient. This approach doesn't
need any changes to driver poll routines, it doesn't need
backwards mapping struct files. It should be fairly easy to
implement a /dev/poll mechanism using this, although I haven't
yet.

  Yes, the change to wait queues has a slight cost, but it isn't
great and the main part of it only happens if you actually sleep.

  Performance graphs and the lmbench derived test programs I
used are at http://www.purplet.demon.co.uk/linux/select/ (bounce
in and out of the index page 'cos the next and prev buttons
aren't wired up :-) )

  Oh, and I updated this patch for 2.4.0-test9.

  Comments and opinions are, as always, welcome :-).

Mike

 select.patch


RE: Linux's implementation of poll() not scalable?

2000-10-30 Thread Mike Jagdis

Here's something I did last year and then put on ice, partly
through lack of time and partly because I thought I'd pick
it up for 2.5.

  All this talk of event queues misses one thing: we already
have an event queue mechanism. They're called wait queues.
The only problem is that the only on-event action possible
is to wake the process (assuming it was asleep in the first
place). This patch firstly extends the wait queue mechanism
to allow an arbitrary action to be performed. Then I rewrote
the select/poll implementation to use event queueing to avoid
rescanning descriptors that had not changed - and restructured
the loops to be rather more efficient. This approach doesn't
need any changes to driver poll routines, it doesn't need
backwards mapping struct files. It should be fairly easy to
implement a /dev/poll mechanism using this, although I haven't
yet.

  Yes, the change to wait queues has a slight cost, but it isn't
great and the main part of it only happens if you actually sleep.

  Performance graphs and the lmbench derived test programs I
used are at http://www.purplet.demon.co.uk/linux/select/ (bounce
in and out of the index page 'cos the next and prev buttons
aren't wired up :-) )

  Oh, and I updated this patch for 2.4.0-test9.

  Comments and opinions are, as always, welcome :-).

Mike

 select.patch


Readiness vs. completion (was: Re: Linux's implementation of poll()not scalable?)

2000-10-29 Thread Dan Kegel

John Gardiner Myers <[EMAIL PROTECTED]> wrote:
> Your proposed interface suffers from most of the same problems as the 
> other Unix event interfaces I've seen. Key among the problems are 
> inherent race conditions when the interface is used by multithreaded 
> applications. 
> 
> The "stickiness" of the event binding can cause race conditions where 
> two threads simultaneously attempt to handle the same event. For 
> example, consider a socket becomes writeable, delivering a writable 
> event to one of the multiple threads calling get_events(). While the 
> callback for that event is running, it writes to the socket, causing the 
> socket to become non-writable and then writeable again. That in turn 
> can cause another writable event to be delivered to some other thread. 
> ...
> In the async I/O library I work with, this problem is addressed by 
> having delivery of the event atomically remove the binding. If the 
> event needs to be bound after the callback is finished, then it is the 
> responsibility for the callback to rebind the event. 

IMHO you're describing a situation where a 'completion notification event'
(as with aio) would be more appropriate than a 'readiness notification event'
(as with poll).

With completion notification, one naturally expects 'edge triggered',
'one shot' behavior from the notification system, with no event coalescing,
and there is no need to remove or reestablish bindings.

> There are three performance issues that need to be addressed by the 
> implementation of get_events(). One is that events preferably be given 
> to threads that are the same CPU as bound the event. That allows the 
> event's context to remain in the CPU's cache. 
> 
> Two is that of threads on a given CPU, events should wake threads in 
> LIFO order. This allows excess worker threads to be paged out. 
> 
> Three is that the event queue should limit the number of worker threads 
> contending with each other for CPU. If an event is triggered while 
> there are enough worker threads in runnable state, it is better to wait 
> for one of those threads to block before delivering the event. 

That describes NT's 'completion port / thread pooling' scheme, I think
(which incidentally is a 'completion notification' rather than a 'readiness
notification' - based scheme).

I suspect readiness notification using edge triggering is a
strange beast, not often seen in the wild, and hard to define precisely.

I'm going to risk generalizing, and categorizing the existing base
of application software into two groups.  Would it be going to far to say 
the following:

  Readiness notification, like that provided by traditional poll(),
  fits naturally with level-triggered events with event coalescing,
  and a large body of traditional Unix software exists that uses this paradigm.

  Completion notification, like that provided by aio and NT's networking,
  fits naturally with edge-triggered events with no event coalescing,
  and a large body of win32 software exists that uses this paradigm.

And, come to think of it, network programmers usually can be categorized
into the same two groups :-)  Each style of programming is an acquired taste.

IMHO if Linux is to be maximally popular with software developers
(desirable if we want to boost the number of apps available for Linux),
it would help to cater to both flavors of network programming.

So I'd like to see both a high-performance level-triggered readiness
notification API with event coalescing, and a high-performance edge-triggered
completion API with no event coalescing.  With luck, they'll be the
same API, but with slightly different flag values.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Readiness vs. completion (was: Re: Linux's implementation of poll()not scalable?)

2000-10-29 Thread Dan Kegel

John Gardiner Myers [EMAIL PROTECTED] wrote:
 Your proposed interface suffers from most of the same problems as the 
 other Unix event interfaces I've seen. Key among the problems are 
 inherent race conditions when the interface is used by multithreaded 
 applications. 
 
 The "stickiness" of the event binding can cause race conditions where 
 two threads simultaneously attempt to handle the same event. For 
 example, consider a socket becomes writeable, delivering a writable 
 event to one of the multiple threads calling get_events(). While the 
 callback for that event is running, it writes to the socket, causing the 
 socket to become non-writable and then writeable again. That in turn 
 can cause another writable event to be delivered to some other thread. 
 ...
 In the async I/O library I work with, this problem is addressed by 
 having delivery of the event atomically remove the binding. If the 
 event needs to be bound after the callback is finished, then it is the 
 responsibility for the callback to rebind the event. 

IMHO you're describing a situation where a 'completion notification event'
(as with aio) would be more appropriate than a 'readiness notification event'
(as with poll).

With completion notification, one naturally expects 'edge triggered',
'one shot' behavior from the notification system, with no event coalescing,
and there is no need to remove or reestablish bindings.

 There are three performance issues that need to be addressed by the 
 implementation of get_events(). One is that events preferably be given 
 to threads that are the same CPU as bound the event. That allows the 
 event's context to remain in the CPU's cache. 
 
 Two is that of threads on a given CPU, events should wake threads in 
 LIFO order. This allows excess worker threads to be paged out. 
 
 Three is that the event queue should limit the number of worker threads 
 contending with each other for CPU. If an event is triggered while 
 there are enough worker threads in runnable state, it is better to wait 
 for one of those threads to block before delivering the event. 

That describes NT's 'completion port / thread pooling' scheme, I think
(which incidentally is a 'completion notification' rather than a 'readiness
notification' - based scheme).

I suspect readiness notification using edge triggering is a
strange beast, not often seen in the wild, and hard to define precisely.

I'm going to risk generalizing, and categorizing the existing base
of application software into two groups.  Would it be going to far to say 
the following:

  Readiness notification, like that provided by traditional poll(),
  fits naturally with level-triggered events with event coalescing,
  and a large body of traditional Unix software exists that uses this paradigm.

  Completion notification, like that provided by aio and NT's networking,
  fits naturally with edge-triggered events with no event coalescing,
  and a large body of win32 software exists that uses this paradigm.

And, come to think of it, network programmers usually can be categorized
into the same two groups :-)  Each style of programming is an acquired taste.

IMHO if Linux is to be maximally popular with software developers
(desirable if we want to boost the number of apps available for Linux),
it would help to cater to both flavors of network programming.

So I'd like to see both a high-performance level-triggered readiness
notification API with event coalescing, and a high-performance edge-triggered
completion API with no event coalescing.  With luck, they'll be the
same API, but with slightly different flag values.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-27 Thread John Gardiner Myers

Linus Torvolds wrote:
> So sticky arrays of events are good, while queues are bad. Let's take that 
> as one of the fundamentals.

Please let's not.  There is nothing fundamentally different between an
event queue of size N and an interest set of size N.

Your proposed interface suffers from most of the same problems as the
other Unix event interfaces I've seen.  Key among the problems are
inherent race conditions when the interface is used by multithreaded
applications.

The "stickiness" of the event binding can cause race conditions where
two threads simultaneously attempt to handle the same event.  For
example, consider a socket becomes writeable, delivering a writable
event to one of the multiple threads calling get_events().  While the
callback for that event is running, it writes to the socket, causing the
socket to become non-writable and then writeable again.  That in turn
can cause another writable event to be delivered to some other thread.

In the async I/O library I work with, this problem is addressed by
having delivery of the event atomically remove the binding.  If the
event needs to be bound after the callback is finished, then it is the
responsibility for the callback to rebind the event.  Typically, a
server implements a command/response protocol, where the read callback 
reads a command, writes the response, and repeats.  It is quite likely
that after the read callback completes, the connection is interested in
a write event, not another read event.

Cancellation of event bindings is another area prone to race
conditions.  A thread canceling an event binding frequently needs to
know whether or not that event has been delivered to some other thread. 
If it has, the canceling thread has to synchronize with the other thread
before destroying any resources associated with the event.

Multiple event queues are needed by libraries as different libraries can
have different thread pools.  The threads in those different pools can
have different characteristics, such as stack size, priority, CPU
affinity, signal state, etc.

There are three performance issues that need to be addressed by the
implementation of get_events().  One is that events preferably be given
to threads that are the same CPU as bound the event.  That allows the
event's context to remain in the CPU's cache.

Two is that of threads on a given CPU, events should wake threads in
LIFO order.  This allows excess worker threads to be paged out.

Three is that the event queue should limit the number of worker threads
contending with each other for CPU.  If an event is triggered while
there are enough worker threads in runnable state, it is better to wait
for one of those threads to block before delivering the event.
 S/MIME Cryptographic Signature


RE: Linux's implementation of poll() not scalable?

2000-10-27 Thread Chris Swiedler

> It doesn't practically matter how efficient the X server is when
> you aren't busy, after all.

A simple polling scheme (i.e. not using poll() or select(), just looping
through all fd's trying nonblocking reads) is perfectly efficient when the
server is 100% busy, and perfectly inefficient when there is nothing to do.
I'm not saying that your statements are wrong--in your example, X is calling
select() which is not wasting as much time as a hard-polling loop--but it's
wrong to say that high-load efficiency is the primary concern. I would be
horrified if X took a signifigant portion of the CPU time when many clients
were connected, but none were actually doing anything.

chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



RE: Linux's implementation of poll() not scalable?

2000-10-27 Thread Chris Swiedler

 It doesn't practically matter how efficient the X server is when
 you aren't busy, after all.

A simple polling scheme (i.e. not using poll() or select(), just looping
through all fd's trying nonblocking reads) is perfectly efficient when the
server is 100% busy, and perfectly inefficient when there is nothing to do.
I'm not saying that your statements are wrong--in your example, X is calling
select() which is not wasting as much time as a hard-polling loop--but it's
wrong to say that high-load efficiency is the primary concern. I would be
horrified if X took a signifigant portion of the CPU time when many clients
were connected, but none were actually doing anything.

chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-27 Thread John Gardiner Myers

Linus Torvolds wrote:
 So sticky arrays of events are good, while queues are bad. Let's take that 
 as one of the fundamentals.

Please let's not.  There is nothing fundamentally different between an
event queue of size N and an interest set of size N.

Your proposed interface suffers from most of the same problems as the
other Unix event interfaces I've seen.  Key among the problems are
inherent race conditions when the interface is used by multithreaded
applications.

The "stickiness" of the event binding can cause race conditions where
two threads simultaneously attempt to handle the same event.  For
example, consider a socket becomes writeable, delivering a writable
event to one of the multiple threads calling get_events().  While the
callback for that event is running, it writes to the socket, causing the
socket to become non-writable and then writeable again.  That in turn
can cause another writable event to be delivered to some other thread.

In the async I/O library I work with, this problem is addressed by
having delivery of the event atomically remove the binding.  If the
event needs to be bound after the callback is finished, then it is the
responsibility for the callback to rebind the event.  Typically, a
server implements a command/response protocol, where the read callback 
reads a command, writes the response, and repeats.  It is quite likely
that after the read callback completes, the connection is interested in
a write event, not another read event.

Cancellation of event bindings is another area prone to race
conditions.  A thread canceling an event binding frequently needs to
know whether or not that event has been delivered to some other thread. 
If it has, the canceling thread has to synchronize with the other thread
before destroying any resources associated with the event.

Multiple event queues are needed by libraries as different libraries can
have different thread pools.  The threads in those different pools can
have different characteristics, such as stack size, priority, CPU
affinity, signal state, etc.

There are three performance issues that need to be addressed by the
implementation of get_events().  One is that events preferably be given
to threads that are the same CPU as bound the event.  That allows the
event's context to remain in the CPU's cache.

Two is that of threads on a given CPU, events should wake threads in
LIFO order.  This allows excess worker threads to be paged out.

Three is that the event queue should limit the number of worker threads
contending with each other for CPU.  If an event is triggered while
there are enough worker threads in runnable state, it is better to wait
for one of those threads to block before delivering the event.
 S/MIME Cryptographic Signature


Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Jonathan Lemon

In article [EMAIL PROTECTED]> you write:
>Linus Torvalds wrote:
>> I'd much rather have an event interface that is documented to be edge-
>> triggered and is really _lightweight_, than have another interface that
>> starts out with some piggy features.
>
>Agreed (except for that 'edge-triggered' part), but I don't think
>'level-triggered' implies piggy.   I haven't benchmarked whether
>kqueue() slows down the networking layer of FreeBSD yet; do you
>suspect maintaining the level-triggered structures actually is
>a bottleneck for them?

I really don't think it's a bottleneck.  At the moment, all events
are maintained on a linked list.  To dequeue an event, we simply:

1. take the event on the front of the list.
2. validate event.  (call filter function)
3. copy event into return array.
4. put event back on end of list.

If the `EV_ONESHOT' flag is set, we skip steps 2 & 4, and destroy
the event after it is returned to the user.
(we want to wait only once for this particular event)

If the `EV_CLEAR' flag is set, we skip step 4.
(pure edge-triggered delivery)

Step 4 is pretty simple, just re-insertion back onto the queue.

If you eliminate Step 2, then you have a `correctness' issue; where
the application must deal with stale events.  The validation function
is equally lightweight and doesn't (IMO) cause a performance problem.


>> ... the "re-bind()" approach works very simply, and means that the
>> overhead of testing whether the event is still active is not a generic
>> thing that _always_ has to be done, but something where the application
>> can basically give the kernel the information that "this time we're
>> leaving the event possibly half-done, please re-test now".
>
>Hmm.  I don't like the extra system call, though.  Any chance you'd be
>willing to make get_events() take a vector of bind requests, so we can
>avoid the system call overhead of re-binding?  (Or is that too close
>to kqueue for you?)

IMO, I'd think that the calls should be orthogonal.  If the "get_events()"
call returns an array, why shouldn't the "bind_request()" call as well?
Otherwise you're only amortizing the system calls in one direction.


>And are you sure apps will always know whether they need to rebind?
>Sometimes you're faced with a protocol stack which may or may not
>read the requests fully, and which you aren't allowed to change.
>It'd be nice to still have a high-performance interface that can deal with 
>that situation.

Agreed.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Dan Kegel

Linus Torvalds wrote:
> However, we also need to remember what got us to this discussion in the
> first place. One of the reasons why poll() is such a piggy interface is
> exactly because it tries to be "nice" to the programmer.

poll() is a piggy interface because it is O(n) in polled file 
descriptors, not because of its level-triggered nature.

> I'd much rather have an event interface that is documented to be edge-
> triggered and is really _lightweight_, than have another interface that
> starts out with some piggy features.

Agreed (except for that 'edge-triggered' part), but I don't think
'level-triggered' implies piggy.   I haven't benchmarked whether
kqueue() slows down the networking layer of FreeBSD yet; do you
suspect maintaining the level-triggered structures actually is
a bottleneck for them?
 
> I do understand that level to some degree is "nicer", but everybody pretty
> much agrees that apart from requireing more care, edge-triggered certainly
> does everything the level-triggered interfaces do.

Agreed.
 
> For example, if you want to only partially handle an event (one of the
> more valid arguments I've seen, although I do not agree with it actually
> being all that common or important thing to do), the edge-triggered
> interface _does_ allow for that. It's fairly easy, in fact: you just
> re-bind the thing.
> 
> (My suggested "bind()" interface would be to just always add a newly bound
> fd to the event queue, but I would not object to a "one-time test for
> active" kind of "bind()" interface either - which would basically allow
> for a poll() interface - and the existing Linux internal "poll()"
> implementation actually already allows for exactly this so it wouldn't
> even add any complexity).
> ... the "re-bind()" approach works very simply, and means that the
> overhead of testing whether the event is still active is not a generic
> thing that _always_ has to be done, but something where the application
> can basically give the kernel the information that "this time we're
> leaving the event possibly half-done, please re-test now".

Hmm.  I don't like the extra system call, though.  Any chance you'd be
willing to make get_events() take a vector of bind requests, so we can
avoid the system call overhead of re-binding?  (Or is that too close
to kqueue for you?)

And are you sure apps will always know whether they need to rebind?
Sometimes you're faced with a protocol stack which may or may not
read the requests fully, and which you aren't allowed to change.
It'd be nice to still have a high-performance interface that can deal with 
that situation.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Dan Kegel

Jim Gettys wrote:
> So I want an interface in which I can get as many events as possible
> at once, and one in which the events themselves can have appropriate
> aggregation behavior.  It isn't quite clear to me if the proposed interface
> would have this property.

I believe get_event, /dev/poll, and kqueue all share this property.  
e.g. none of them will present multiple POLLIN events per fd per call.  
Is that what you meant?

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Jim Gettys


Note that there is another aspect to the efficiency / performance of the 
select/poll style of interfaces not immediately obvious, but which occurs 
as a result of how some (streaming/batching) protocols work.

An X server does not call select all that often (probably one of the two items most 
frequently used that care; 
though I believe getting the Apache case right is more important).

X is such a streaming protocol: it is a feature that I don't have to do 
extra reads or system calls to deal with more data arriving from a client.  
An X server doesn't want one event generated for each incoming TCP segment: 
it merely needs to know there is data available on a file descriptor as 
a binary condition.  I really don't want to have to do one operation per 
segment; this is less efficient than the current situation.

Similarly, it is a feature that with one call I can find out that there
is work to do on multiple file descriptors.

In short, the X server does a select, and then loops across all the file
descriptors with work to do before doing another select: the system call
overhead gets amortized across multiple clients and buffers received from
the client.  As the server gets busier, it is more and more likely
that there is more than one client with work to do, and/or multiple
TCP segments have arrived to process (in the common single client
is busy case).  So we make the system call less and less often
as a fraction of work done.

This has the happy consequence that the select caused overhead DROPS as
a fraction of total work as the X server gets busier, and X is most efficient
at the point in time you care the most: when you have the most work to
do.  The system call is returning more information each time it is called,
and some of that information is aggregated as well (additional data arriving).
It doesn't practically matter how efficient the X server is when
you aren't busy, after all.

This aggregation property is therefore important, and there needs to be
some way to achieve this, IMHO.

Web servers often have similar behavior, though since most current
HTTP clients don't implement streaming behavior, the benefit is currently
much lower (would that HTTP clients start driving HTTP servers the
way the HTTP/1.1 protocol allows...  Sigh...).  Right now, scaling
to large numbers of descriptors is most urgent for big web servers.

So I want an interface in which I can get as many events as possible
at once, and one in which the events themselves can have appropriate
aggregation behavior.  It isn't quite clear to me if the proposed interface
would have this property.

As I said in early talks about X: "X is an exercise in avoiding system
calls"

- Jim Gettys
--
Jim Gettys
Technology and Corporate Development
Compaq Computer Corporation
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Dan Kegel

"Eric W. Biederman" wrote:
> 
> Dan Kegel <[EMAIL PROTECTED]> writes:
> > It's harder to write correct programs that use edge-triggered events.
> 
> Huh? The race between when an event is reported, and when you take action
> on it effectively means all events are edge triggered.

Nope.  With any of these interfaces, whether level or edge, the app must
treat all the events as hints, and be prepared for them to be wrong.
That's why code that uses poll() tends to use nonblocking sockets.
No matter what you do, you can't get away from that.  Consider
accepting a connection.  Poll or whatever tells you a listening socket
is ready for you to call accept().  Before you do, the other end sends
an RST.  Consequence: app must be prepared for a readiness event to be wrong.
cf. http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0616.html

This is orthogonal to the question of whether edge or level triggering
is easier to write code for, I think.
 
> So making the interface clearly edge triggered seems to be a win for
> correctness.

With level-triggered interfaces like poll(), your chances
of writing a correctly functioning program are higher because you
can throw events away (on purpose or accidentally) with no consequences; 
the next time around the loop, poll() will happily tell you the current
state of all the fd's.

With edge-triggered interfaces, the programmer must be much more careful
to avoid ever dropping a single event; if he does, he's screwed.
This gives rise to complicated code in apps to remember the current
state of fd's in those cases where the programmer has to drop events.
And there are many cases like that; see 
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0529.html and
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0592.html
for examples.

Better to let apps ask the kernel for the current state of each socket if
they want, is what I say.  This reduces the amount of code and effort needed
to write many apps, *and makes migrating legacy apps to high-performance
interfaces easier*.
For new apps that are willing to maintain the state 
themselves, by all means, provide an edge-oriented interface, too.
It's probably better if your code is manly enough to handle it.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Eric W. Biederman

Dan Kegel <[EMAIL PROTECTED]> writes:

> It's harder to write correct programs that use edge-triggered events.

Huh? The race between when an event is reported, and when you take action 
on it effectively means all events are edge triggered. 

So making the interface clearly edge triggered seems to be a win for
correctness.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Eric W. Biederman

Dan Kegel [EMAIL PROTECTED] writes:

 It's harder to write correct programs that use edge-triggered events.

Huh? The race between when an event is reported, and when you take action 
on it effectively means all events are edge triggered. 

So making the interface clearly edge triggered seems to be a win for
correctness.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Dan Kegel

"Eric W. Biederman" wrote:
 
 Dan Kegel [EMAIL PROTECTED] writes:
  It's harder to write correct programs that use edge-triggered events.
 
 Huh? The race between when an event is reported, and when you take action
 on it effectively means all events are edge triggered.

Nope.  With any of these interfaces, whether level or edge, the app must
treat all the events as hints, and be prepared for them to be wrong.
That's why code that uses poll() tends to use nonblocking sockets.
No matter what you do, you can't get away from that.  Consider
accepting a connection.  Poll or whatever tells you a listening socket
is ready for you to call accept().  Before you do, the other end sends
an RST.  Consequence: app must be prepared for a readiness event to be wrong.
cf. http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0616.html

This is orthogonal to the question of whether edge or level triggering
is easier to write code for, I think.
 
 So making the interface clearly edge triggered seems to be a win for
 correctness.

With level-triggered interfaces like poll(), your chances
of writing a correctly functioning program are higher because you
can throw events away (on purpose or accidentally) with no consequences; 
the next time around the loop, poll() will happily tell you the current
state of all the fd's.

With edge-triggered interfaces, the programmer must be much more careful
to avoid ever dropping a single event; if he does, he's screwed.
This gives rise to complicated code in apps to remember the current
state of fd's in those cases where the programmer has to drop events.
And there are many cases like that; see 
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0529.html and
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0592.html
for examples.

Better to let apps ask the kernel for the current state of each socket if
they want, is what I say.  This reduces the amount of code and effort needed
to write many apps, *and makes migrating legacy apps to high-performance
interfaces easier*.
For new apps that are willing to maintain the state 
themselves, by all means, provide an edge-oriented interface, too.
It's probably better if your code is manly enough to handle it.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Jim Gettys


Note that there is another aspect to the efficiency / performance of the 
select/poll style of interfaces not immediately obvious, but which occurs 
as a result of how some (streaming/batching) protocols work.

An X server does not call select all that often (probably one of the two items most 
frequently used that care; 
though I believe getting the Apache case right is more important).

X is such a streaming protocol: it is a feature that I don't have to do 
extra reads or system calls to deal with more data arriving from a client.  
An X server doesn't want one event generated for each incoming TCP segment: 
it merely needs to know there is data available on a file descriptor as 
a binary condition.  I really don't want to have to do one operation per 
segment; this is less efficient than the current situation.

Similarly, it is a feature that with one call I can find out that there
is work to do on multiple file descriptors.

In short, the X server does a select, and then loops across all the file
descriptors with work to do before doing another select: the system call
overhead gets amortized across multiple clients and buffers received from
the client.  As the server gets busier, it is more and more likely
that there is more than one client with work to do, and/or multiple
TCP segments have arrived to process (in the common single client
is busy case).  So we make the system call less and less often
as a fraction of work done.

This has the happy consequence that the select caused overhead DROPS as
a fraction of total work as the X server gets busier, and X is most efficient
at the point in time you care the most: when you have the most work to
do.  The system call is returning more information each time it is called,
and some of that information is aggregated as well (additional data arriving).
It doesn't practically matter how efficient the X server is when
you aren't busy, after all.

This aggregation property is therefore important, and there needs to be
some way to achieve this, IMHO.

Web servers often have similar behavior, though since most current
HTTP clients don't implement streaming behavior, the benefit is currently
much lower (would that HTTP clients start driving HTTP servers the
way the HTTP/1.1 protocol allows...  Sigh...).  Right now, scaling
to large numbers of descriptors is most urgent for big web servers.

So I want an interface in which I can get as many events as possible
at once, and one in which the events themselves can have appropriate
aggregation behavior.  It isn't quite clear to me if the proposed interface
would have this property.

As I said in early talks about X: "X is an exercise in avoiding system
calls"

- Jim Gettys
--
Jim Gettys
Technology and Corporate Development
Compaq Computer Corporation
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Dan Kegel

Jim Gettys wrote:
 So I want an interface in which I can get as many events as possible
 at once, and one in which the events themselves can have appropriate
 aggregation behavior.  It isn't quite clear to me if the proposed interface
 would have this property.

I believe get_event, /dev/poll, and kqueue all share this property.  
e.g. none of them will present multiple POLLIN events per fd per call.  
Is that what you meant?

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Dan Kegel

Linus Torvalds wrote:
 However, we also need to remember what got us to this discussion in the
 first place. One of the reasons why poll() is such a piggy interface is
 exactly because it tries to be "nice" to the programmer.

poll() is a piggy interface because it is O(n) in polled file 
descriptors, not because of its level-triggered nature.

 I'd much rather have an event interface that is documented to be edge-
 triggered and is really _lightweight_, than have another interface that
 starts out with some piggy features.

Agreed (except for that 'edge-triggered' part), but I don't think
'level-triggered' implies piggy.   I haven't benchmarked whether
kqueue() slows down the networking layer of FreeBSD yet; do you
suspect maintaining the level-triggered structures actually is
a bottleneck for them?
 
 I do understand that level to some degree is "nicer", but everybody pretty
 much agrees that apart from requireing more care, edge-triggered certainly
 does everything the level-triggered interfaces do.

Agreed.
 
 For example, if you want to only partially handle an event (one of the
 more valid arguments I've seen, although I do not agree with it actually
 being all that common or important thing to do), the edge-triggered
 interface _does_ allow for that. It's fairly easy, in fact: you just
 re-bind the thing.
 
 (My suggested "bind()" interface would be to just always add a newly bound
 fd to the event queue, but I would not object to a "one-time test for
 active" kind of "bind()" interface either - which would basically allow
 for a poll() interface - and the existing Linux internal "poll()"
 implementation actually already allows for exactly this so it wouldn't
 even add any complexity).
 ... the "re-bind()" approach works very simply, and means that the
 overhead of testing whether the event is still active is not a generic
 thing that _always_ has to be done, but something where the application
 can basically give the kernel the information that "this time we're
 leaving the event possibly half-done, please re-test now".

Hmm.  I don't like the extra system call, though.  Any chance you'd be
willing to make get_events() take a vector of bind requests, so we can
avoid the system call overhead of re-binding?  (Or is that too close
to kqueue for you?)

And are you sure apps will always know whether they need to rebind?
Sometimes you're faced with a protocol stack which may or may not
read the requests fully, and which you aren't allowed to change.
It'd be nice to still have a high-performance interface that can deal with 
that situation.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Jonathan Lemon

In article local.mail.linux-kernel/[EMAIL PROTECTED] you write:
Linus Torvalds wrote:
 I'd much rather have an event interface that is documented to be edge-
 triggered and is really _lightweight_, than have another interface that
 starts out with some piggy features.

Agreed (except for that 'edge-triggered' part), but I don't think
'level-triggered' implies piggy.   I haven't benchmarked whether
kqueue() slows down the networking layer of FreeBSD yet; do you
suspect maintaining the level-triggered structures actually is
a bottleneck for them?

I really don't think it's a bottleneck.  At the moment, all events
are maintained on a linked list.  To dequeue an event, we simply:

1. take the event on the front of the list.
2. validate event.  (call filter function)
3. copy event into return array.
4. put event back on end of list.

If the `EV_ONESHOT' flag is set, we skip steps 2  4, and destroy
the event after it is returned to the user.
(we want to wait only once for this particular event)

If the `EV_CLEAR' flag is set, we skip step 4.
(pure edge-triggered delivery)

Step 4 is pretty simple, just re-insertion back onto the queue.

If you eliminate Step 2, then you have a `correctness' issue; where
the application must deal with stale events.  The validation function
is equally lightweight and doesn't (IMO) cause a performance problem.


 ... the "re-bind()" approach works very simply, and means that the
 overhead of testing whether the event is still active is not a generic
 thing that _always_ has to be done, but something where the application
 can basically give the kernel the information that "this time we're
 leaving the event possibly half-done, please re-test now".

Hmm.  I don't like the extra system call, though.  Any chance you'd be
willing to make get_events() take a vector of bind requests, so we can
avoid the system call overhead of re-binding?  (Or is that too close
to kqueue for you?)

IMO, I'd think that the calls should be orthogonal.  If the "get_events()"
call returns an array, why shouldn't the "bind_request()" call as well?
Otherwise you're only amortizing the system calls in one direction.


And are you sure apps will always know whether they need to rebind?
Sometimes you're faced with a protocol stack which may or may not
read the requests fully, and which you aren't allowed to change.
It'd be nice to still have a high-performance interface that can deal with 
that situation.

Agreed.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Dan Kegel

Helge Hafting wrote:
> > With poll(), it was *not a bug* for the user code to drop events; with
> > your proposed interface, it *is a bug* for the user code to drop events.
> > I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
> > incorrectly that your interface "has the same semantics as poll from
> > the event perspective".
> 
> Don't worry - existing interfaces will be around for existing
> applications.  I believe Linus is trying to engineer a new and better
> interface that will make a difference - (server) programs written to use
> the new stuff will show better performance use up less kernel time.
>
> The semantics will be different, so only new programs will benefit.
> However, the benefits may be so great that people will want to rewrite 
> programs to get them.

It's harder to write correct programs that use edge-triggered events.
Level-triggered events are inherantly easier to use.  If our high-performance
interface only supports edge-triggered events, people who write high
performance apps may well migrate to FreeBSD, where there is a lovely
high performance interface that offers either level or edge triggered
events, whichever you prefer.

> Seems Linus don't want to do this no matter what you comes up with.  You
> are free to try of course...

Or free to switch to FreeBSD, if that's the only OS which has a
high-performance poll replacement that I can get my existing codebase
to work with.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Simon Kirby

On Tue, Oct 24, 2000 at 04:12:38PM -0700, Dan Kegel wrote:

> With poll(), it was *not a bug* for the user code to drop events; with
> your proposed interface, it *is a bug* for the user code to drop events.
> I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
> incorrectly that your interface "has the same semantics as poll from
> the event perspective".

I missed this because I've never written anything that drops or forgets
events and didn't think about it.  Most programs will read() until EOF is
returned and write() until EAGAIN is returned with non-blocking sockets.
Is there any reason to ignore events other than to slow down response to
some events in favor to others?

I don't see why this is a problem as this interface _isn't_ replacing
select or poll, so it shouldn't matter for existing programs that aren't
converted to use the new interface.

In any case, I think I would prefer that the kernel be optimized for the
common case and leave any strange processing up to userspace so that the
majority of programs which don't need this special case can run as fast
as possible.  Besides, it wouldn't be difficult for a program to stack up
a list of events, even in the same structure as it would get from the
kernel, so that it can process them later.  At least then this data would
be in swappable memory.  Heck, even from an efficiency perspective, it
would be faster for userspace to store the data as it wouldn't keep
getting it returned from a syscall each time...

Am I missing something else?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Kai Henningsen

[EMAIL PROTECTED] (Linus Torvalds)  wrote on 23.10.00 in 
<[EMAIL PROTECTED]>:

> actually inform about the events. The way to do this simply is to limit it
> in very clear ways, the most notable one being simply that there is only
> one event queue per process (or rather, per "struct files_struct" - so
> threads would automatically share the event queue). This keeps the

While the rest looks fine, I suspect this one is a show-stopper.

IMO, the ability to wait for separate event sets in separate threads is a  
MUST. In a single-threaded program, it would probably still be useful in a  
number of situations.

Not that it's all that difficult to expand your model to do multiple  
queues.

MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Gideon Glass

> 
> 
> 
> On Tue, 24 Oct 2000, Evan Jeffrey wrote:
> > 
> > > Multiple event queues are bad, because it completely breaks the notion of
> > > even-driven programming. How do you want to listen to them all? You can't.
> > > You can only listen to one event queue at a time - unless you create some
> > 
> > You can listen to one event queue per thread.
> 
> Oh, I agree.
> 
> And I think something like CLONE_EVENTS would be fine - and decide
> yourself what kind of threads you want (do you want indistinguishable
> "anonymous" threads like apache, or do you want a threads that handle
> separate event queues). Or you might have a mixture of the two - for web
> serving the "accept()" event list might be a separate thing with a few
> threads doing that, while "worker threads" handle the actual IO..

All this is fine; but why force people to break up their apps
into threads?

 [ from an earlier message ]

> Trust me. Multiple event queues are idiotic. Anybody who _designs_ for
> multiple event queues has some serious problems.

So, are you now saying that "serious problems" are no longer
to be had, provided that an app is chopped into threads?

If you have one queue per process, multiple queues could
be implemented in a straightforward manner.  Add a "queue"
fd type and a system call to create one.  Allow any queuable
fd (such as a socket) to be queued on the one implicit queue
and also to be associated with zero or one explicitly-created queues.
This would satisfy library or other code that wanted for whatever
reason to have visibility into all fds with pending events.
It would also give apps the ability to break up work amongst
disjoint portions without forcing the use of multiple threads.

This seems reasonable to implement.  Or am I missing something?

gid


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Kai Henningsen

[EMAIL PROTECTED] (Linus Torvalds)  wrote on 23.10.00 in 
[EMAIL PROTECTED]:

 actually inform about the events. The way to do this simply is to limit it
 in very clear ways, the most notable one being simply that there is only
 one event queue per process (or rather, per "struct files_struct" - so
 threads would automatically share the event queue). This keeps the

While the rest looks fine, I suspect this one is a show-stopper.

IMO, the ability to wait for separate event sets in separate threads is a  
MUST. In a single-threaded program, it would probably still be useful in a  
number of situations.

Not that it's all that difficult to expand your model to do multiple  
queues.

MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Simon Kirby

On Tue, Oct 24, 2000 at 04:12:38PM -0700, Dan Kegel wrote:

 With poll(), it was *not a bug* for the user code to drop events; with
 your proposed interface, it *is a bug* for the user code to drop events.
 I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
 incorrectly that your interface "has the same semantics as poll from
 the event perspective".

I missed this because I've never written anything that drops or forgets
events and didn't think about it.  Most programs will read() until EOF is
returned and write() until EAGAIN is returned with non-blocking sockets.
Is there any reason to ignore events other than to slow down response to
some events in favor to others?

I don't see why this is a problem as this interface _isn't_ replacing
select or poll, so it shouldn't matter for existing programs that aren't
converted to use the new interface.

In any case, I think I would prefer that the kernel be optimized for the
common case and leave any strange processing up to userspace so that the
majority of programs which don't need this special case can run as fast
as possible.  Besides, it wouldn't be difficult for a program to stack up
a list of events, even in the same structure as it would get from the
kernel, so that it can process them later.  At least then this data would
be in swappable memory.  Heck, even from an efficiency perspective, it
would be faster for userspace to store the data as it wouldn't keep
getting it returned from a syscall each time...

Am I missing something else?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Dan Kegel

Helge Hafting wrote:
  With poll(), it was *not a bug* for the user code to drop events; with
  your proposed interface, it *is a bug* for the user code to drop events.
  I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
  incorrectly that your interface "has the same semantics as poll from
  the event perspective".
 
 Don't worry - existing interfaces will be around for existing
 applications.  I believe Linus is trying to engineer a new and better
 interface that will make a difference - (server) programs written to use
 the new stuff will show better performance use up less kernel time.

 The semantics will be different, so only new programs will benefit.
 However, the benefits may be so great that people will want to rewrite 
 programs to get them.

It's harder to write correct programs that use edge-triggered events.
Level-triggered events are inherantly easier to use.  If our high-performance
interface only supports edge-triggered events, people who write high
performance apps may well migrate to FreeBSD, where there is a lovely
high performance interface that offers either level or edge triggered
events, whichever you prefer.

 Seems Linus don't want to do this no matter what you comes up with.  You
 are free to try of course...

Or free to switch to FreeBSD, if that's the only OS which has a
high-performance poll replacement that I can get my existing codebase
to work with.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Edgar Toernig

Linus Torvalds wrote:
> 
> The point they disagree is when the event gets removed from the event
> queue. For edge triggered, this one is trivial: when a get_events() thing
> happens and moves it into user land. This is basically a one-liner, and it
> is local to get_events() and needs absolutely no help from anybody else.
> So obviously event removal is _very_ simple for edge-triggered events -
> the INTACK basically removes the event (and also re-arms the trigger
> logic: which is different from most interrupt controllers, so the analogy
> falls down here).

And IMHO here's a problem.  The events are no longer events.  They are
just hints saying: after the previous get_events() something has happened.
You don't know if you've already handled this event.  There's no synchron-
ization between what the app does and the triggering of 'hints'.

For example your waitpid-loop: you get the event, start the waitpid-loop.
While processing another process dies.  You handle it too (still in the
loop).  But a new 'hint' has already been registered.  So on the next
get_event you'll be notified again.  I just hope, every event-generator
has a WNOHANG flag...

It could even be possible, that you are unable to perform some actions
without triggering hints despite the fact that the conditions will
already be gone before the next get_event.  May generate lot of bogus
hints.

At least the current semantic of for example "POLL_IN on fd was signaled
so I may read without blocking" gets lost.

Maybe (don't know kernel wise) it makes sense to check in the kernel
if the events to be returned to userspace are still valid.  The user
space has to do it anyway...  But that way you get a more level-based
design ;)


Another thing: while toying with cooperative userspace multithreading
I found it much more versatile to have a req_type/req_data tuple in
the request structure (ie READ/, TIMEOUT/, WAKEUP/).

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Mark Montague

Linus Torvalds <[EMAIL PROTECTED]> writes:

>   bind_event(sock, POLLIN, NULL, accept_fn);

[...]

> (In fact, you might as well move the event array completely inside
> "get_event()", because nobody would be supposed to look at the raw array
> any more. So the "get_event()" interface would be even simpler: just a
> timeout, nothing more. Talk about simple programming.

It might be good to have bind_event return any previous accept_fn
that's there, in case a library or something wants to chain a handler
rather than clobbering what's there. Or maybe have a way to query,
although this seems fraught with danger if another thread binds
between a query and a bind.

- M

-- 
Mark "Monty" Montague | [EMAIL PROTECTED]  | I don't do Windows(tm)
I'm dubious about any company whose assets can be destroyed by rm -rf
  http://www.gg.caltech.edu/~monty/monty.shtml>
 X-PGP-Fingerprint: E4 EA 6D B1 82 46 DB A1  B0 FF 60 B9 F9 5D 5C F7
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Linus Torvalds <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
> 
> Oh, I agree.
> 
> And I think something like CLONE_EVENTS would be fine - and decide
> yourself what kind of threads you want (do you want indistinguishable
> "anonymous" threads like apache, or do you want a threads that handle
> separate event queues). Or you might have a mixture of the two - for web
> serving the "accept()" event list might be a separate thing with a few
> threads doing that, while "worker threads" handle the actual IO..
> 
> But if you want to have some event queue ID, I just wonder how you'd set
> it up sanely without tons of complications..
> 

How about handle the event queue ID as an event mask internally?

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
> > But user code currently written for poll() has the luxury of dropping
> > events because poll() will happily report on the current readiness of
> > the socket every time.  /dev/poll is level-triggered because it's trying
> > to make conversion of poll()-based code easy.  With your scheme,
> > whatever user code is receiving the edges better darn well do something
> > about them, because it's only going to get them once.
> 
> Oh, I agree. I'm not saying that my approach magically fixes bugs in user
> space ;)

With poll(), it was *not a bug* for the user code to drop events; with
your proposed interface, it *is a bug* for the user code to drop events.
I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
incorrectly that your interface "has the same semantics as poll from
the event perspective".

[ Linus explains why the level-triggered style is difficult ]

>  - the "proactively remove events when the thing that triggerred them goes
>away" approach means that each anti-event (like a read that empties the
>buffer) needs to undo it's events, but it also needs to be careful that
>it doesn't undo combined events, and it needs to be very careful about
>races (new packet coming in), so the proactive remove actually ends up
>being less than trivial - and in a performance-critical section.

That's the approach I was thinking of.  It should be about as difficult
as continuously maintaining an integer for each fd which matches what poll() 
with a zero timeout on that fd would return.  (The difference is that
as the bits of the integer change, you would move the fd's record in and out of
the ready list.)  How hard is that?  

> Implement it, and see. I bet you'll find that it gets really interesting
> when you have concurrent accesses to the same file descriptor etc.

Sure, but lots of the kernel is really interesting.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Evan Jeffrey wrote:
> 
> > Multiple event queues are bad, because it completely breaks the notion of
> > even-driven programming. How do you want to listen to them all? You can't.
> > You can only listen to one event queue at a time - unless you create some
> 
> You can listen to one event queue per thread.

Oh, I agree.

And I think something like CLONE_EVENTS would be fine - and decide
yourself what kind of threads you want (do you want indistinguishable
"anonymous" threads like apache, or do you want a threads that handle
separate event queues). Or you might have a mixture of the two - for web
serving the "accept()" event list might be a separate thing with a few
threads doing that, while "worker threads" handle the actual IO..

But if you want to have some event queue ID, I just wonder how you'd set
it up sanely without tons of complications..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Evan Jeffrey


> Multiple event queues are bad, because it completely breaks the notion of
> even-driven programming. How do you want to listen to them all? You can't.
> You can only listen to one event queue at a time - unless you create some

You can listen to one event queue per thread.  Maybe in the case people
always bring up (Apache handing 1 concurrent connections), threads
are indistiguishable, but for many applications different threads or thread
groups will be assigned to different tasks.   An ORB ought to be allowed
to clone off an event handler thread that doesn't have to worry about
receiving X events, which might be really slow to handle.  (Or vice versa,
handling a mouse_move event shouldn't have to wait for a CORBA upcall that
recalculates a spreasheet).

Also, what happens if someone does a clone with CLONE_FILES but no CLONE_VM?
Now the event handlers and opaque data point to things that may not even be
valid in this VM.

I see a couple of ways to do this that don't give in much from the "One
event handler list per process" idea.  Add a CLONE_EVENTS flag to clone.  This
is rather nasty, since it violates orthogonality (since a struct event
references both VM and fd objects), and makes it difficult for pthreads users
to choose whether to share event handlers.  The second is to tag events
with a threadid, and allow add_event to specify a flag saying "only deliver
to this thread" and get_events to say "wait for this threads events, or any
threads events".  The third is to just declare "event handlers are shared
by any processes sharing both a VM and FILES".  This seems rather silly
(since it would still have to be independant in the task_struct, but with
no user space control over it), and doensn't help the poor SOB who has
to handle 

> kind of "hierarchy" of event queues, where you have one event queue that
> you wait on that contains the _other_ event queues, so that you can tell
> which one has anything pending..

Agreed.  The only reason to do this is if your event queue sucks as bad
as poll() and you have to do so for effeciency reasons... (Like people do
with 10 threads each handling 1/10 of the fds).

Evan
---
Fear is the mind killer.   -- Frank Herbert, "Dune"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Maas

> Shouldn't there also be a way to add non-filedescriptor based events
> into this, such as "child exited" or "signal caught" or shm things?

Waiting on pthreads condition variables, POSIX message queues, and
semaphores (as well as fd's) at the same time would *rock*...

Unifying all these "waitable objects" would be tremendously helpful to fully
exploit the "library transparency" advantage that Linus brought up. Some
libraries might want to wait on things that are not file descriptors...

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Miquel van Smoorenburg

In article <[EMAIL PROTECTED]>,
Linus Torvalds  <[EMAIL PROTECTED]> wrote:
>   struct event {
>   unsigned long id;   /* file descriptor ID the event is on */
>   unsigned long event;/* bitmask of active events */
>   };
>   int bind_event(int fd, struct event *event);

Shouldn't there also be a way to add non-filedescriptor based events
into this, such as "child exited" or "signal caught" or shm things?

Or should there simply be ways to make those events available
through filedescriptors anyway? Then you'd get /dev/events - open it, tell
it what kind of event you want to bind to the fd, pass it to bind_event,
read a struct event_info from it when there's something to read.

Just a random thought. Ignore if not appropriate.

Mike.
-- 
Q: How many ears do Trekkies have?
A: Three; the left ear, the right ear, and the final front ear.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Mitchell Blank Jr

Linus Torvalds wrote:
> >   * it doesn't add extra syscalls
> 
> Sure it does. 
> 
> What do you think ioctl's are?

As I explained a few lines down from where you stopped quoting (and probably
stopped reading) the ioctl() use is just an artifact of Solaris's icky
implementation.  It could and should use read().  Not an extra syscall.

> Sure, you can have multiple "poll lists" by opening /dev/poll multiple
> times, and you seem to see this as a great feature. I'm telling you that
> multiple event queues are a horrible disaster, and it's a perfect example
> of what an engineer with the "more is better" mindset comes up with.

It has nothing to do with how exciting of a feature it is.  It's a matter
"hey, if you implement it with /dev/poll, you get multiple queues FOR
FREE".  Those sorts of things are an indication that you've got the
implementation right.  The fact that in your scheme multiple queues would
be extra work is just residue from the fact that you've got the idea wrong.

And how do you propose dealing with multi-threaded programs.  Obviously for
SMP you'd want multiple threads to be able to grab events, so I guess the
"one great queue" will be shared among all the threads.  But, suppose I
want to dedicate a thread to coping with a specific set of events.  H...
guess you'll need a CLONE_EVENTQUEUE flag to support that.  With /dev/poll
none of this foolishness is needed - you just open tow of them and let
the threads do what they want.

> Multiple event queues are bad, because it completely breaks the notion of
> even-driven programming. How do you want to listen to them all? You can't.

Explained in detail in my mail.

> You can only listen to one event queue at a time - unless you create some
> kind of "hierarchy" of event queues, where you have one event queue that
> you wait on that contains the _other_ event queues, so that you can tell
> which one has anything pending..

No, event queues of event queues are messy in implementation.  There's
nothing fundamentally wrong with them, but you need to make sure the
graph stays acyclic which is just an implementation hassle.  Not enough
gain for too much code (unless someone has a beautiful algorithm)

However, it's perfectly reasonable to use select() or poll() on two event
queues.  As I explained in my email, this is needed if you're already using
a library that does its own event queue based on poll or select.

> Basically, tastes differ. Everything you seem to see as an advantage of
> /dev/poll, I see as a horrible kludge.

Well, everything I see indicates that /dev/poll would be the easier of
the two interfaces to actually implement.  All the added flexibility is
just gravy.

-Mitch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Abramo Bagnara wrote:

> Linus Torvalds wrote:
> > 
> > 
> > struct event {
> > int fd;
> > unsigned long mask;
> > void *opaque;
> > void (*event_fn)(ind fd, unsigned long mask, void *opaque);
> 
> My experience say that:
> 
>   unsigned long rmask;
>   void (*event_fn)(struct event *event);
> 
> is a far better solution (more type safe, more descriptive).

You're probably right. It certainly makes it easier to add fields later on
if we'd want to extend the ID part without having to change old users..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Dan Kegel wrote:
> 
> But user code currently written for poll() has the luxury of dropping
> events because poll() will happily report on the current readiness of
> the socket every time.  /dev/poll is level-triggered because it's trying
> to make conversion of poll()-based code easy.  With your scheme, 
> whatever user code is receiving the edges better darn well do something
> about them, because it's only going to get them once.

Oh, I agree. I'm not saying that my approach magically fixes bugs in user
space ;)

> > The BSD kevent paper goes on about "level and edge triggered" and it
> > becomes a big thing for them, and they selected level-triggered events as
> > if it made any difference. And sure - it _does_ make a difference, but the
> > only difference is in how hard it is to implement, and level-triggered is
> > a noticeably harder.
> 
> I don't see why edge triggered is that much harder.  All it adds is
   level
> a layer which receives the edges and moves fds back and forth between
> a 'ready' list and a 'not ready' list.  Easy as pie.

Not true.

For example, if you're truly level-triggered, and you have a socket that
gets data, the event move onto the event queue. So far so fine: both edge
and level agree about this one.

The point they disagree is when the event gets removed from the event
queue. For edge triggered, this one is trivial: when a get_events() thing
happens and moves it into user land. This is basically a one-liner, and it
is local to get_events() and needs absolutely no help from anybody else.
So obviously event removal is _very_ simple for edge-triggered events -
the INTACK basically removes the event (and also re-arms the trigger
logic: which is different from most interrupt controllers, so the analogy 
falls down here).

For level, the thing is not as easy at ALL. Suddenly removal becomes a big
issue, and needs help from the actual driver. You can do it two ways:
calling down to the driver when you remove (to see if the event should be
dismissed or not once it has been read) or have the driver pro-actively
remove the event whenever a read() happens (or whatever that undoes the
event).

Both are actually fairly hard. Much harder than they sound. For different
reasons.

 - the callback approach at get_events() time sounds trivial, but actually
   has two problems: cache footprint for "get_events()" grows a _lot_
   (because the events are likely to be spread out a lot if there are a
   lot of them pending, so you don't get a nice tight inner loop at all),
   and you get "double events" - by the time the event first happens, it
   will still be active, so we cannot actually remove it at that time
   (there is still data to be read - and the event doesn't go away until
   we read it) so we'll get the event _again_, and on the next
   get_events() it will notice that the event was bogus, and remove it
   (and we can optimize it away from reporting it to user land at that
   point, so only the kernel needs to look at it twice and do two
   callbacks)

 - the "proactively remove events when the thing that triggerred them goes
   away" approach means that each anti-event (like a read that empties the
   buffer) needs to undo it's events, but it also needs to be careful that
   it doesn't undo combined events, and it needs to be very careful about
   races (new packet coming in), so the proactive remove actually ends up
   being less than trivial - and in a performance-critical section.

Now, compare that to a one-liner that just does a "list_del(>list)"
as it copies over the event to user mode. Woudln't you say that the
edge-triggered version is simpler?

> > The reason "edge-triggered" (ie only when an event changes) is preferable
> > is that it's MUCH simpler, and means that the "get_events()" system call
> > doesn't need to actually understand what the events mean at all. 
> 
> Not much understanding is required on the part of the edge-to-level filter.

Implement it, and see. I bet you'll find that it gets really interesting
when you have concurrent accesses to the same file descriptor etc.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
> > * Do you get an event whenever an fd is ready for something, or
> >   only when its readiness changes?  (Presumably whenever an fd is ready for 
>something?)
> 
> Only when its readiness changes - probably with the addition that it would
> simplify things that a new event always (unconditionally) gets added to
> the event queue.
>
> Note that there are no races in this implementation - there is no way for
> events to get lost, even trivially seen in SMP/threaded environments.
> That's important.

But user code currently written for poll() has the luxury of dropping
events because poll() will happily report on the current readiness of
the socket every time.  /dev/poll is level-triggered because it's trying
to make conversion of poll()-based code easy.  With your scheme, 
whatever user code is receiving the edges better darn well do something
about them, because it's only going to get them once.
 
> The BSD kevent paper goes on about "level and edge triggered" and it
> becomes a big thing for them, and they selected level-triggered events as
> if it made any difference. And sure - it _does_ make a difference, but the
> only difference is in how hard it is to implement, and level-triggered is
> a noticeably harder.

I don't see why edge triggered is that much harder.  All it adds is
a layer which receives the edges and moves fds back and forth between
a 'ready' list and a 'not ready' list.  Easy as pie.

> The reason "edge-triggered" (ie only when an event changes) is preferable
> is that it's MUCH simpler, and means that the "get_events()" system call
> doesn't need to actually understand what the events mean at all. 

Not much understanding is required on the part of the edge-to-level filter.

That said, the edge-to-level filter can be in userspace, and the kernel
part can be just edge-triggered -- as long as the kernel delivers the
downward edges as well as the upward edges.   We could do that by
adding NOTPOLLIN, NOTPOLLOUT, etc., i.e. events that are delivered
when a socket becomes UN-ready for reading.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Lee Chin

There is only one thiong I don't understand about this... why can't we
re-implement the poll() implementation of Linux instead of introducing
another system call?

If I understood Linux correctly, what he is saying is that the bind_event
system call is needed to give the kernel a hint that the application is
interested in a certain event associated with a file descriptor.

If the kernel kept such an event queue per process any way (as soon as the
process opened the file/socket)... then the poll implementation would be
exactly like the get_events system call.  What is wrong with this?

Thanks
Lee

On Mon, 23 Oct 2000, Linus Torvalds wrote:
>
> > What is your favourite interface then ?
>
> I suspect a good interface that can easily be done efficiently would
> basically be something where the user _does_ do the equivalent of a
> read-only mmap() of poll entries - and explicit and controlled
> "add_entry()" and "remove_entry()" controls, so that the kernel can
> maintain the cache without playing tricks.

Actually, forget the mmap, it's not needed.

Here's a suggested "good" interface that would certainly be easy to
implement, and very easy to use, with none of the scalability issues that
many interfaces have.

First, let's see what is so nice about "select()" and "poll()". They do
have one _huge_ advantage, which is why you want to fall back on poll()
once the RT signal interface stops working. What is that?

Basically, RT signals or any kind of event queue has a major fundamental
queuing theory problem: if you have events happening really quickly, the
events pile up, and queuing theory tells you that as you start having
queueing problems, your latency increases, which in turn tends to mean
that later events are even more likely to queue up, and you end up in a
nasty meltdown schenario where your queues get longer and longer.

This is why RT signals suck so badly as a generic interface - clearly we
cannot keep sending RT signals forever, because we'd run out of memory
just keeping the signal queue information around.

Neither poll() nor select() have this problem: they don't get more
expensive as you have more and more events - their expense is the number
of file descriptors, not the number of events per se. In fact, both poll()
and select() tend to perform _better_ when you have pending events, as
they are both amenable to optimizations when there is no need for waiting,
and scanning the arrays can use early-out semantics.

So sticky arrays of events are good, while queues are bad. Let's take that
as one of the fundamentals.

So why do people still like RT signals? They do have one advantage, which
is that you do NOT have that silly array traversal when there is nothing
to do. Basically, the RT signals kind of approach is really good for the
cases where select() and poll() suck: no need to traverse mostly empty and
non-changing arrays all the time.

It boils down to one very simple rule: dense arrays of sticky status
information is good. So let's design a good interface for a dense array.

Basically, the perfect interface for events would be

struct event {
unsigned long id; /* file descriptor ID the event is on */
unsigned long event; /* bitmask of active events */
};

int get_events(struct event * event_array, int maxnr, struct timeval
*tmout);

where you say "I want an array of pending events, and I have an array you
can fill with up to 'maxnr' events - and if you have no events for me,
please sleep until you get one, or until 'tmout'".

The above looks like a _really_ simple interface to me. Much simpler than
either select() or poll(). Agreed?

Now, we still need to inform the kernel of what kind of events we want, ie
the "binding" of events. The most straightforward way to do that is to
just do a simple "bind_event()" system call:

int bind_event(int fd, struct event *event);

which basically says: I'm interested in the events in "event" on the file
descriptor "fd". The way to stop being interested in events is to just set
the event bitmask to zero.

Now, the perfect interface would be the above. Nothing more. Nothing
fancy, nothing complicated. Only really simple stuff. Remember the old
rule: "keep it simple, stupid".

The really nice part of the above is that it's trivial to implement. It's
about 50 lines of code, plus some simple logic to various drivers etc to
actually inform about the events. The way to do this simply is to limit it
in very clear ways, the most notable one being simply that there is only
one event queue per process (or rather, per "struct files_struct" - so
threads would automatically share the event queue). This keeps the
implementation simple, but it's also what keeps the interfaces simple: no
queue ID's to pass around etc.

Implementation is straightforward: the event queue basically consists of

- a queue head in "struct files_struct", initially empty.

- doing a "bind_event()" basically adds a fasync entry to the file
structure, but rather than cause a signal, it just looks 

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Abramo Bagnara

Linus Torvalds wrote:
> 
> 
> struct event {
> int fd;
> unsigned long mask;
> void *opaque;
> void (*event_fn)(ind fd, unsigned long mask, void *opaque);

My experience say that:

unsigned long rmask;
void (*event_fn)(struct event *event);

is a far better solution (more type safe, more descriptive).


-- 
Abramo Bagnara   mailto:[EMAIL PROTECTED]

Opera Unica
Via Emilia Interna, 140  Phone: +39.0546.656023
48014 Castel Bolognese (RA) - Italy  Fax:   +39.0546.656023

ALSA project ishttp://www.alsa-project.org
sponsored by SuSE Linuxhttp://www.suse.com

It sounds good!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Simon Kirby

On Tue, Oct 24, 2000 at 10:03:04AM -0700, Linus Torvalds wrote:

> Basically, with get_events(), there is a maximum of one event per "bind".
> And the memory for that is statically allocated at bind_event() time. 
>... 
> But you'd be doing so in a controlled manner: the memory use wouldn't go
> up just because there is a sudden influx of 5 packets. So it scales
> with load by virtue of simply not _caring_ about the load - it only cares
> about the number of fd's you're waiting on.

Nice.  I like this.

It would be easy for existing userspace code to start using this
interface as it has the same semantics as select/poll from the event
perspective.  But it would make things even easier, as the bind would
follow the life of the descriptor and thus wouldn't need to be "requeued"
before every get_events call, so that part of userspace code could just
be ripped out^W^W disabled and kept only for portability.

In most of the daemons I have written, I've ended up using memcpy() to
keep a non-scribbled-over copy of the fdsets around so I don't have to
walk data structures and requeue fds on every loop for select()...nasty.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Simon Kirby wrote:
> 
> However, isn't there already something like this, albeit maybe without
> the ability to return multiple events at a time?  When discussing
> select/poll on IRC a while ago with sct, sct said:
> 
>   Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for
>SIGIO as usual.

[ siginfo and RT signals, but using "sigwaitinfo()" instead of actual
  signal handlers ]

The problem with RT signals (whether you use sigwaitinfo() or signal
handling overhead) is that they don't scale to lots of events. The memory
pressure for keeping track of millions of siginfo structures is more than
any system can handle - so all RT signals implementations will have to
eventually fall back on something else.

Which makes for really hard to debug bugs, and complex programming (you
essentially end up having to duplicate all of the event handling, and the
"fallback" case has the added difficulty that you won't even see it except
under heavy load). 

> Also, what is different in your above interface that prevents it from
> being able to queue up too many events?

Basically, with get_events(), there is a maximum of one event per "bind".
And the memory for that is statically allocated at bind_event() time. 

> I guess the structure is only
> sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000
> FDs on x86, but I don't see how the other method would be significantly
> different.  The kernel would have to store the queued events still,
> surely...

Oh, the bind information would be more like 32 bytes per bind, so assuming
you have 2 fd's you're waiting for, you'd certainly be using a lot of
memory. Still less than the quite complex SIGIO structures, though.

But you'd be doing so in a controlled manner: the memory use wouldn't go
up just because there is a sudden influx of 5 packets. So it scales
with load by virtue of simply not _caring_ about the load - it only cares
about the number of fd's you're waiting on.

(Obviously _other_ parts of the system will have to keep up with tons of
new packets, but that's why protocols like TCP have flow control, and
that is something that is done independently of the event notification,
so regardless of _what_ kind of event model you have you have to handle
the simple load of TCP ;).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Simon Kirby

On Mon, Oct 23, 2000 at 10:39:36PM -0700, Linus Torvalds wrote:

> Actually, forget the mmap, it's not needed.
> 
> Here's a suggested "good" interface that would certainly be easy to
> implement, and very easy to use, with none of the scalability issues that
> many interfaces have.
>...
> Basically, the perfect interface for events would be
> 
>   struct event {
>   unsigned long id;   /* file descriptor ID the event is on */
>   unsigned long event;/* bitmask of active events */
>   };
> 
>   int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

I like. :)

However, isn't there already something like this, albeit maybe without
the ability to return multiple events at a time?  When discussing
select/poll on IRC a while ago with sct, sct said:

  Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for
   SIGIO as usual.
  Simon: Then fcntl(fd, F_SETSIG, rtsignum)
  Simon: And you'll get a signal queue which passes you the fd of
   each SIGIO in turn.
  sct: easy :)
  Simon: You don't even need the overhead of a signal handler:
   instead of select(), you just do "sigwaitinfo(, timeout)"
   and it will do a select-style IO wait, returning the fd in the
   siginfo when it's available.

(Captured from IRC on Nov 12th, 1998.)

Or does this menthod still have the overhead of encapsulating the events
into signals within the kernel?

Also, what is different in your above interface that prevents it from
being able to queue up too many events?  I guess the structure is only
sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000
FDs on x86, but I don't see how the other method would be significantly
different.  The kernel would have to store the queued events still,
surely...

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Mitchell Blank Jr wrote:
> I think everyone should take a timeout and look at Solaris 8's /dev/poll
> interface.  This discussion is reinventing the wheel, the lever, and the
> inclined plane.
> 
>   http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123
> 
> I think this is a lot cleaner than your approach:

You must be kidding.

/dev/poll is the interface from _hell_. It's absolutely disgusting.

>   * it doesn't add extra syscalls

Sure it does. 

What do you think ioctl's are? Every single ioctl number is an extra
system call. It's just indirected inside the device driver instead of at
the normal system call entry point. The advantage? It's noticeably slower
than a real system call.

And the final reason why /dev/poll should just go away: not only is the
interface ugly, but it's hard to use too. It's really hard to mix
different users of /dev/poll - you'll have to create a whole new event
structure around it.

Sure, you can have multiple "poll lists" by opening /dev/poll multiple
times, and you seem to see this as a great feature. I'm telling you that
multiple event queues are a horrible disaster, and it's a perfect example
of what an engineer with the "more is better" mindset comes up with.

Multiple event queues are bad, because it completely breaks the notion of
even-driven programming. How do you want to listen to them all? You can't.
You can only listen to one event queue at a time - unless you create some
kind of "hierarchy" of event queues, where you have one event queue that
you wait on that contains the _other_ event queues, so that you can tell
which one has anything pending..

Trust me. Multiple event queues are idiotic. Anybody who _designs_ for
multiple event queues has some serious problems.

Basically, tastes differ. Everything you seem to see as an advantage of
/dev/poll, I see as a horrible kludge.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Dan Kegel wrote:
> Linus Torvalds wrote:
> > Basically, the main loop would boil down to
> > for (;;) {
> > static struct event ev_list[MAXEV];
> > get_event(ev_list, MAXEV, );
> > .. timeout handling here ..
> > }
> > 
> > because get_even() would end up doing all the user-mode calls too (so
> > "get_event()" is no longer a system call: it's a system call + a for-loop
> > to call all the ID handler functions that were associated with the events
> > that triggered).
> 
> Occurs to me that this stuff will be used from other languages
> than C, which might prefer to do their own event dispatching,
> and would want the system to treat event_fn as another opaque quantity.

Yes and no.

The KERNEL would treat it as just another opaque quantity: after all, it
would never ever touch the thing other than get it as part of
"bind_event()" and return it as part of "get_events()". So as far as the
kernel is concerned, both "event_fn" and "opaque" are just random
information.

However, it is very important to have every common user agree about the
meaning of the ID's in user space: the whole point of encapsulating
"event_fn" as a function pointer is that you can have different parts of
the same program use the "bind_event()" interface, and they won't step on
each others toes.

So if you program in Fortran, for example, and you expect to link against
the C library, and the C library uses events for something, then you'd
better make sure that the fortran interfaces end up using the event_fn
thing in the same wasy as the C one..

> So I don't object to a function that dispatches events, e.g.
>  int get_event();
> as long as there's also 
>  int get_events(ev_list, MAXEV, )
> that treats event_fn as an opaque pointer and *does not* dispatch the events.

That's a user-mode issue, and obviously, if you don't care about
compatibility with anything else you can do this. The kernel won't know,
or care. But I think it is unlikely that you'd ever use the raw events in
any other way - most environments under UNIX tend to have a C library
component somewhere, because the low-level stuff is written in C and uses
the common code that way even when the programmer isn't aware of it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds


[ Moving on to practical matters ]

On Tue, 24 Oct 2000, Dan Kegel wrote:
> 
> Might be good to pick more unique names than 'struct event' and 'get_event'.
> People probably use those already.

I agree. I would never implement them under those names, but it's easier
to talk about "event" than about something very specific.

> Hiding ev_list is probably ok.  However,
> http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf
> suggests that knowing how many events are pending is a useful measure 
> of server load, and that if more than a certain number of events
> are pending, web servers should reject new connections.  Thus it might
> be handy to make the return value of get_event be the number of events
> gotten.

Note that my "get_event()" function actually did return exactly that: it
returned the number of events, even though the actual event array handling
was hidden inside the library function.

> > So now you'd start everything off (assuming the same kind of "listen to
> > everything and react to it" server as in my previous example) by just
> > setting
> > 
> > bind_event(sock, POLLIN, NULL, accept_fn);
> 
> A couple questions:
> 
> * how do you unbind?  (By calling bind_event with NULL as the accept_fn?)

If done that way, I'd prefer something where an event _mask_ of 0 means
that the event obviously no longer exists, as it can no longer trigger.
The accept_fn would be part of the identifier, and as such zeroing it out
would lose the identification.

But I'd personally probably prefer to be more explicit, and have a
separate function for it. 

Note that whether it's a separate function or just making "bind_event()"
really be "modify_or_bind_or_remove_event()" is pretty much not anything I
care about - it gets to be so low-level details that it would be something
that is probably best tried out and something that I don't have any
religious issues with.

I care about high-level interfaces, because _those_ are the ones that
screw me years down the line. For example, select() (and later bind()) was
always an interface that did _not_ fit into the UNIX way of doing things
AT ALL. It really stands out among all the IO interfaces as being very
different. And it's one of the nastier ones to maintain.

> * how do you change a mask?  (By calling bind_event with a new mask?)

Same as above. Either bind with a new mask (zero being remove), or
separate call.

> * Is it ok to do these things while in an event_fn?  (Yes?)

Oh, absolutely. The kernel doesn't even _know_ when something is in an
event function - the kernel only sees "change the event" and "get the list
of events".

> * Do you get an event whenever an fd is ready for something, or
>   only when its readiness changes?  (Presumably whenever an fd is ready for 
>something?)

Only when its readiness changes - probably with the addition that it would
simplify things that a new event always (unconditionally) gets added to
the event queue.

Note that there are no races in this implementation - there is no way for
events to get lost, even trivially seen in SMP/threaded environments.
That's important. 

The BSD kevent paper goes on about "level and edge triggered" and it
becomes a big thing for them, and they selected level-triggered events as 
if it made any difference. And sure - it _does_ make a difference, but the
only difference is in how hard it is to implement, and level-triggered is
a noticeably harder.

The bind_event()/get_events() interface is edge-triggered, and works
fundamentally the same way as SIGCHLD. If you don't keep up, you get
SIGCHLD only once, even if ten children died on you in rapid succession.
And it's edge-triggered: that doesn't mean that you lose events (like the
BSD paper implies), it only means that you have to reap your children with
a loop:

while ((err = waitpid(..)) >= 0) {
.. get all children ..
}

The reason "edge-triggered" (ie only when an event changes) is preferable
is that it's MUCH simpler, and means that the "get_events()" system call
doesn't need to actually understand what the events mean at all. It will
just take the events off the list. If anythig triggers after that, it will
be put back on the list again, but you can guarantee that you don't get
any lost events simply by the simple fact that

 - by the time get_events() returns to user mode (ie before the action
   functions have been called), any new event that happens even before we
   have had to react to the old events will cause the event to be queued
   up again.

What does this mean? It basically means that if you continue to get events
on a file descriptor during your event handler function, the event will
have moved back to the event list, and next time you _will_ get an event
again.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Mitchell Blank Jr

Linus Torvalds wrote:
> Here's a suggested "good" interface that would certainly be easy to
> implement, and very easy to use, with none of the scalability issues that
> many interfaces have.

I think everyone should take a timeout and look at Solaris 8's /dev/poll
interface.  This discussion is reinventing the wheel, the lever, and the
inclined plane.

http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123

I think this is a lot cleaner than your approach:
  * it doesn't add extra syscalls

  * it handles multiple event queues, and does so without ugliness.
all the per-queue state is held in the /dev/poll's "struct file"
instance

  * in your method you have a per-process queue - but under what clone()
conditions is it shared among siblings?  Here the user has a choice -
they can share the open("/dev/poll") file descriptor or not using
any of the existing means.  Granted, they also would probably want
to arrange to share the fd's being polled in order to make this
useful.

  * No new fields in task_struct

A few simple improvements can be made to the Sun model, though:

  * The fact that the fd of /dev/poll can't be used for poll(2) or select(2)
is ugly.  Sure, you probably don't want an open instance of /dev/poll
being added to another /dev/poll, but being able to call "select" on
them would be really useful:
  1. Imagine a server that has to process connections from both
 high-priority and low-priority clients - and that requests from
 the high-priority ones always take precedence.  With this
 interface you could easily have two open instances of /dev/poll
 and then call select(2) on them.  This ability just falls
 out naturally from the interface.
  2. Some libraries are internally driven by select(2) loops (I think
 Xlib is like this, IIRC)  If you have a lot of fd's you want to
 watch, this means you must take the hit of calling select(2) on
 all of them.  If you could just pass in a fd for /dev/poll,
 problem solved.

  * I think the fact that you add events via write(2) but retrieve them
via ioctl(2) is an ugly asymmetry.  From what I can see, the entire
motivation for using ioctl as opposed to read(2) is to allow the user
to specify a timeout.  If you could use poll(2)/select(2) on /dev/poll
this issue would be moot (see above)

  * It would be nice if the interface were flexible enough to report
items _other_ than "activity on fd" in the future.  Maybe SYSV IPC?
itimers?  directory update notification?  It seems that every time
UNIX adds a mechanism of waiting for events, we spoil it by not
making it flexible enough to wait on everything you might want.
Lets not repeat the mistake with a new interface.

  * The "struct pollfd" and "struct dvpoll" should also include a 64-bit
opaque quantity supplied by userland which is returned with each event
on that fd.  This would save the program from having to look up
which connection context goes with each fd - the kernel could just
give you the pointer to the structure.  Not having this capability isn't
a burden for programs dealing with a small number of fd's (since they
can just have a lookup table) but if you potentially have 1's of
connections it may be undesirable to allocate an array for them all.

The linux patch of /dev/poll implements mmap'ing of the in-kenrel poll
table... I don't think this is a good idea.  First, the user just wants to
be able to add events and dequeue them - both linear operations.  Second,
why should the kernel be forced to maintain a fixed-size linear list of
events we're looking for... this seems mindlessly inefficient.  Why not
just pull a "struct pollfd" out of slab each time a new event is listened
for?

My unresolved concerns:
  * Is this substantially better than the already existing rtsig+poll
solution?  Enough to make implementing it worth while?

  * How do we quickly do the "struct file" -> "struct pollfd" conversion
each time an event happens?  Especially if there are multiple /dev/poll
instances open in the current process.  Probably each "struct file"
would need a pointer to the instance of /dev/poll which would have
some b-tree variant (or maybe a hash table).  The limitation would
be that a single fd couldn't be polled for events by two different
/dev/poll instances, even for different events.  This is probably
good for sanity's sake anyway.

-Mitch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Felix von Leitner

Thus spake Linus Torvalds ([EMAIL PROTECTED]):
> I disagree.

> Let's just face it, poll() is a bad interface scalability-wise.

Is that a reason to implement it badly?

Felix
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Andi Kleen

On Mon, Oct 23, 2000 at 09:06:11PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 24 Oct 2000, Andi Kleen wrote:
> > 
> > I don't see the problem. You have the poll table allocated in the kernel,
> > the drivers directly change it and the user mmaps it (I was not proposing
> > to let poll make a kiobuf out of the passed array) 
> 
> That's _not_ how poll() works at all.

But how /dev/poll would work. Sorry for not marking it clearly. 

> 
> We don't _have_ a poll table in the kernel, and no way to mmap it. The
> poll() tables gets created dynamically based on the stuff that the user
> has set up in the table. And the user can obviously change the fd's etc in
> the table directly, so in order for the caching to work you need to do
> various games with page table dirty or writable bits, or at least test
> dynamically whether the poll table is the same as it was before.

Actually assuming the user does not duplicate fds in the poll array it
would always work out the lazy way (and when he duplicates fds I would
say the behaviour is undefined -- no spec says in what order poll
must process fds in the supplied poll table)

You either find the correct fd in the offset you cached, and if not
the user has changed the table and you need to recompute all cached offsets. 

No need to play pte tricks. 

> 
> Sure, it's doable, and apparently Solaris does something like this. But
> what _is_ the overhead of the Solaris code for small number of fd's? I bet
> it really is quite noticeable. I also suspect it is very optimized toward
> an unchangning poll-table.

For small number of fds you use the fast_poll/fast_select that I implemented 
in the patch I sent to you.


> 
> > What is your favourite interface then ? 
> 
> I suspect a good interface that can easily be done efficiently would
> basically be something where the user _does_ do the equivalent of a
> read-only mmap() of poll entries - and explicit and controlled
> "add_entry()" and "remove_entry()" controls, so that the kernel can
> maintain the cache without playing tricks.

I think when you say the case of duplicated fd entries in an poll array
as undefined you don't need the controlled add/remove_entry.
[and from a quick glimpse at Single Unix it does not say anything about
the order of fd processing in poll, so the spec is definitely flexible
enough for that]


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
> Basically, the main loop would boil down to
> for (;;) {
> static struct event ev_list[MAXEV];
> get_event(ev_list, MAXEV, );
> .. timeout handling here ..
> }
> 
> because get_even() would end up doing all the user-mode calls too (so
> "get_event()" is no longer a system call: it's a system call + a for-loop
> to call all the ID handler functions that were associated with the events
> that triggered).

Occurs to me that this stuff will be used from other languages
than C, which might prefer to do their own event dispatching,
and would want the system to treat event_fn as another opaque quantity.

So I don't object to a function that dispatches events, e.g.
 int get_event();
as long as there's also 
 int get_events(ev_list, MAXEV, )
that treats event_fn as an opaque pointer and *does not* dispatch the events.

These probably both want to be C library functions (rather than
get_events being only a system call) on the chance that some
operating systems will implement Linux emulation at the
shared library level rather than the syscall level.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
> Basically, the main loop would boil down to
> 
> for (;;) {
> static struct event ev_list[MAXEV];
> get_event(ev_list, MAXEV, );
> .. timeout handling here ..
> }
> 
> because get_even() would end up doing all the user-mode calls too (so
> "get_event()" is no longer a system call: it's a system call + a for-loop
> to call all the ID handler functions that were associated with the events
> that triggered).
> 
> So the "struct event" would just be:
> 
> struct event {
> int fd;
> unsigned long mask;
> void *opaque;
> void (*event_fn)(ind fd, unsigned long mask, void *opaque);
> }
> 
> and there's no need for separate event queues, because the separate event
> queues have been completely subsumed by the fact that every single event
> has a separate event function.

OK, guess I buy the one-queue-to-bind-them-all argument.

Might be good to pick more unique names than 'struct event' and 'get_event'.
People probably use those already.

Hiding ev_list is probably ok.  However,
http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf
suggests that knowing how many events are pending is a useful measure 
of server load, and that if more than a certain number of events
are pending, web servers should reject new connections.  Thus it might
be handy to make the return value of get_event be the number of events
gotten.
 
> So now you'd start everything off (assuming the same kind of "listen to
> everything and react to it" server as in my previous example) by just
> setting
> 
> bind_event(sock, POLLIN, NULL, accept_fn);

A couple questions:

* how do you unbind?  (By calling bind_event with NULL as the accept_fn?)

* how do you change a mask?  (By calling bind_event with a new mask?)

* Is it ok to do these things while in an event_fn?  (Yes?)

* Do you get an event whenever an fd is ready for something, or
  only when its readiness changes?  (Presumably whenever an fd is ready for something?)

> (This is also the ideal event programming interface - signals get racy and
> hard to handle, while in the above example you can trivially just be
> single-threaded. Which doesn't mean that you CANNOT be multi-threaded if
> you want to: you multi-thread things by just having multiple threads that
> all call "get_event()" on their own).

I for one will be glad to not have to think of the races
caused by the delay between signal queueing and signal pickup.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Lincoln Dale

At 10:39 PM 23/10/2000 -0700, Linus Torvalds wrote:
>First, let's see what is so nice about "select()" and "poll()". They do
>have one _huge_ advantage, which is why you want to fall back on poll()
>once the RT signal interface stops working. What is that?

RT methods are bad if they consume too many resources.  SIGIO is a good 
example of this - the current overhead of passing events to user-space 
incurs both a spinlock and a memory copy of 512 bytes for each 
event.  while it removes the requirement to "walk lists", the signal 
semantics in the kernel and the overhead of memory copies to userspace 
negate its performance a fair bit.

that isn't to say that all "event-driven" methods are bad.  in the past 
year, i've done many experiments at making SIGIO more efficient.

some of these experiments include --
  [1] 'aggregate' events.  that is, if you've registered a POLL_IN, no need
  to registered another POLL_IN
  this was marginally successful, but ultimately still didn't scale.

  [2] create a new interface for event delivery.

for i settled on a 16-byte structure sufficient to pass all of the relevant 
information:
 typedef struct zerocopy_buf {
 int fd;
 short int   cmd;
 #define ZEROCOPY_VALID_BUFFER   0xe1e2
 short int   valid_buffer;
 void*buf; /* skbuff */
 #ifdef __KERNEL__
 volatile
 #endif
 struct zerocopy_buf *next;
 } zerocopy_buf_t;

so, we get down to 16 bytes per-event.  these are allocated

coupled with this was an interface whereby user-space could view 
kernel-space (via read-only mmap).
in my case, this allowed for user-space to be able to read the above chain 
of zerocopy_buf events with no kernel-to-user memory copies.

an ioctl on a character driver could ask the kernel to give it the head of 
the chain of the current zerocopy_buf structure.  a similar ioctl() call 
allows it to pass a chain of instructions to the kernel (adding/removing 
events from notification) and other housekeeping.

since user-space had read-only visibility into kernel memory address-space, 
one could then pick up skbuff's in userspace without the overhead of copies.

... and so-on.

the above is a bit of a simplification of what goes on.  using flip-buffers 
of queues, one can use this in multiple processes and be SMP-safe without 
the requirements for spinlocks or semaphores in the "fast path".  solving 
the "walk the list of fd's" and "incur the overhead of memory copies" tied 
in with network hardware capable of handling scatter/gather DMA and IP and 
TCP checksum calculations, i've more than doubled the performance of an 
existing application which depended on poll()-type behaviour.

while i agree that it isn't necessarily a 'generic' interface, and won't 
necessarilly appeal to everyone as the cure-all, the techniques used have 
removed two significant bottlenecks to high-network-performance i/o on 
tens-of-thousands of TCP sockets for an application we've been working on.



cheers,

lincoln.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Mon, 23 Oct 2000, Dan Kegel wrote:
> 
> kqueue lets you associate an arbitrary integer with each event
> specification; the integer is returned along with the event.
> This is very handy for, say, passing the 'this' pointer of the
> object that should handle the event.  Yes, you can simulate it
> with an array indexed by fd, but I like having it included.

I agree. I thin the ID part is the most important part of the interfaces,
because you definitely want to re-use the same functions over and over
again - and you wan tto have some way other than just the raw fd to
identify where the actual _buffers_ for that IO is, and what stage of the
state machine we're in for that fd etc etc.

Also, it's potentially important to have different "id"s for even the same
fd - in some cases you want to have the same event handle both the read
and the write part on an fd, but in other cases it might make more
conceptual sense to separate out the read handling from the write
handling, and instead of using "mask = POLLIN | POLLOUT", you'd just have
two separate events, one with POLLIN and one with POLLOUT.

This was what my "unsigned long id" was, but that is much too hard to use.
See my expanded suggestion of it just a moment ago.

(And yes, I'm sure you can do all this with kevents. I just abhor the
syntax of those things).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Mon, 23 Oct 2000, Dan Kegel wrote:
> 
> 
>http://www.FreeBSD.org/cgi/man.cgi?query=kqueue=0=0=FreeBSD+5.0-current=html
> describes the FreeBSD kqueue interface for events:

I've actually read the BSD kevent stuff, and I think it's classic
over-design. It's not easy to see what it's all about, and the whole  tuple crap is just silly. Looks much too complicated.

I don't believe in the "library" argument at all, and I think multiple
event queues completely detract from the whole point of being simple to
use and implement.

Now, I agree that my bind_event()/get_event() has limitations, and the
biggest one is probably the event "id". It needs to be better, and it
needs to have more structure. The "id" really should be something that not
only contains the "fd", but also contains the actor function to be called,
along with some opaque data for that function.

In fact, if you take my example server, and move the "handle[id]()" call
_into_ get_events() (and make the "handle[id]()" function pointer a part
of the ID of the event), then the library argument goes away too: it
doesn't matter _who_ calls the get_event() function, because the end
result is always going to be the same regardless of whether it is called
from within a library or from a main loop: it's going to call the function
handle associated with the ID that triggered.

Basically, the main loop would boil down to

for (;;) {
static struct event ev_list[MAXEV];
get_event(ev_list, MAXEV, );
.. timeout handling here ..
}

because get_even() would end up doing all the user-mode calls too (so
"get_event()" is no longer a system call: it's a system call + a for-loop
to call all the ID handler functions that were associated with the events
that triggered).

So the "struct event" would just be:

struct event {
int fd;
unsigned long mask;
void *opaque;
void (*event_fn)(ind fd, unsigned long mask, void *opaque);
}

and there's no need for separate event queues, because the separate event
queues have been completely subsumed by the fact that every single event
has a separate event function.

So now you'd start everything off (assuming the same kind of "listen to
everything and react to it" server as in my previous example) by just
setting

bind_event(sock, POLLIN, NULL, accept_fn);

which basically creates the event inside the kernel, and will pass it to
the "__get_event()" system call through the event array, so the
get_event() library function basically looks like

int get_event(struct event *array, int maxevents, struct timeval *tv)
{
int nr = __get_event(array, maxevents, tv);
int i;
for (i = 0; i < nr; i++) {
array->event_fn(array->fd, array->mask, array->opaque);
array++;
}
return nr;
}

and tell me why you'd want to have multiple event queues any more?

(In fact, you might as well move the event array completely inside
"get_event()", because nobody would be supposed to look at the raw array
any more. So the "get_event()" interface would be even simpler: just a
timeout, nothing more. Talk about simple programming.

(This is also the ideal event programming interface - signals get racy and
hard to handle, while in the above example you can trivially just be
single-threaded. Which doesn't mean that you CANNOT be multi-threaded if
you want to: you multi-thread things by just having multiple threads that
all call "get_event()" on their own).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Dave Zarzycki <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
> 
> Maybe I'm missing something, but why do you seperate out fd from the event
> structure. Why not just "int bind_event(struct event *event)"
> 
> The only thing I might have done differently is allow for multiple event
> queues per process, and the ability to add event queues to event
> queues. But these features might not be all that useful in real life.
> 

This could be useful if it doesn't screw up the implementation too
much.

Pretty much, what Linus is saying is the following:

   select()/poll() have one sizable cost, and that is to set up
   and destroy the set of events we want to trigger on.  We would
   like to amortize this cost by making it persistent.

It would definitely be useful for the user to have more than one such
"event set" installed at any one time, so that you can call different
wait_for_event() [or whatever] as appropriate.  However, if that means
we're doing lots of constructing and deconstructing in kernel space,
then we probably didn't gain much.

The other things I think we'd really like in a new interface is an
interface where you can explicitly avoid the "storming hordes" problem
-- if N servers is waiting for the same event, it should be at least
possible to tell the kernel to only wake up one (arbitrarily chosen)
of them, rather than all.

Finally, it would be somewhat nice to have a unified interface for
synchronous and asynchronous notification.  This should be quite
easily doable by adding a call event_notify(event_set,signal) that
causes real-time signal "signal" to be raised (presumably with the
event_set as the argument), when the specified event_set triggers.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Dan Kegel wrote:
> [kqueue is] Pretty similar to yours, with the following additions:
> 
> Your proposal seems to only have one stream of available events per
> process.  kqueue() returns a handle to an event queue, and kevent()
> takes that handle as a first parameter.
> 
> [kqueue] uses a single call to both bind (or unbind) an array of fd's
> and pick up new events.  Probably reduces overhead.
> 
> [kqueue] allows you to watch not just sockets, but also plain files
> or directories.  (Hey, haven't we heard people talk about letting apps
> do that under Linux?  What interface does that use?)

I forgot to mention:

kqueue lets you associate an arbitrary integer with each event
specification; the integer is returned along with the event.
This is very handy for, say, passing the 'this' pointer of the
object that should handle the event.  Yes, you can simulate it
with an array indexed by fd, but I like having it included.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dave Zarzycki

On Mon, 23 Oct 2000, Linus Torvalds wrote:

> where you say "I want an array of pending events, and I have an array you
> can fill with up to 'maxnr' events - and if you have no events for me,
> please sleep until you get one, or until 'tmout'".
> 
> The above looks like a _really_ simple interface to me. Much simpler than
> either select() or poll(). Agreed?

Totally. I've been wanting an API like this in user-space for a long time.

One question about "int bind_event(int fd, struct event *event)"

Maybe I'm missing something, but why do you seperate out fd from the event
structure. Why not just "int bind_event(struct event *event)"

The only thing I might have done differently is allow for multiple event
queues per process, and the ability to add event queues to event
queues. But these features might not be all that useful in real life.

Hmmm... It might be cute if you could do something like this:

int num_of_resulting_events;
int r, q = open("/dev/eventq");
struct event interested_events[1024];
struct event events_that_happened[1024];

/* fill up interested_event_array */

write(q, interested_events, sizeof(interested_events));
r = read(q, events_that_happened, sizeof(events_that_happened));
num_of_resulting_events = r / sizeof(struct event);

davez

-- 
Dave Zarzycki
http://thor.sbay.org/~dave/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

On Mon, 23 Oct 2000, Linus Torvalds wrote: 
> Here's a suggested "good" interface that would certainly be easy to 
> implement, and very easy to use, with none of the scalability issues that 
> many interfaces have.  ...
> It boils down to one very simple rule: dense arrays of sticky status 
> information is good. So let's design a good interface for a dense array. 
> 
> Basically, the perfect interface for events would be 
> 
> struct event { 
> unsigned long id; /* file descriptor ID the event is on */ 
> unsigned long event; /* bitmask of active events */ 
> }; 
> 
> int get_events(struct event * event_array, int maxnr, struct timeval 
>*tmout); 
>
> int bind_event(int fd, struct event *event); 

http://www.FreeBSD.org/cgi/man.cgi?query=kqueue=0=0=FreeBSD+5.0-current=html
describes the FreeBSD kqueue interface for events:

 struct kevent {
 uintptr_t ident;/* identifier for this event */
 short filter;   /* filter for event */
 u_short   flags;/* action flags for kqueue */
 u_int fflags;   /* filter flag value */
 intptr_t  data; /* filter data value */
 void  *udata;   /* opaque user data identifier */
 };

 int kqueue(void);

 int kevent(int kq, const struct kevent *changelist, int nchanges,
 struct kevent *eventlist, int nevents,
 const struct timespec *timeout);

Pretty similar to yours, with the following additions:

Your proposal seems to only have one stream of available events per
process.  kqueue() returns a handle to an event queue, and kevent()
takes that handle as a first parameter.

Their proposal uses a single call to both bind (or unbind) an array of fd's
and pick up new events.  Probably reduces overhead.

Their proposal allows you to watch not just sockets, but also plain files
or directories.  (Hey, haven't we heard people talk about letting apps
do that under Linux?  What interface does that use?)

> The really nice part of the above is that it's trivial to implement. It's 
> about 50 lines of code, plus some simple logic to various drivers etc to 
> actually inform about the events. The way to do this simply is to limit it 
> in very clear ways, the most notable one being simply that there is only 
> one event queue per process (or rather, per "struct files_struct" - so 
> threads would automatically share the event queue). This keeps the 
> implementation simple, but it's also what keeps the interfaces simple: no 
> queue ID's to pass around etc. 

I dislike having only one event queue per process.  Makes it hard to use
in a library.  ("But wait, *I* was going to use bind_event()!")

> Advantage: everything is O(1), except for "get_event()" which is O(n) 
> where 'n' is the number of active events that it fills in. 

This is a big win, and is exactly the payoff that Solaris gets with /dev/poll.
 
> Example "server": 

See  http://www.monkeys.com/kqueue/echo.c for a very similar example server
for kqueue() / kevent().

> You get the idea. Very simple, and looks like it should perform quite 
> admirably. With none of the complexities of signal handling, and none of 
> the downsides of select() or poll(). Both of the new system calls would be 
> on the order of 20-30 lines of code (along with the infrastructure to 
> extend the fasync stuff to also be able to handle events) 

Go, Linus, go!  Let's do it!  But don't go *too* simple.  I really would
like a few of the things kqueue has.
 
> Yes, I'm probably missing something. But this is the kind of thing I think 
> we should look at (and notice that you can _emulate_ this with poll(), so 
> you can use this kind of interface even on systems that wouldn't support 
> these kinds of events natively). 

- Dan

p.s. my collection of notes on kqueue is at 
http://www.kegel.com/c10k.html#nb.kqueue
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

On Mon, 23 Oct 2000, Linus Torvalds wrote: 
 Here's a suggested "good" interface that would certainly be easy to 
 implement, and very easy to use, with none of the scalability issues that 
 many interfaces have.  ...
 It boils down to one very simple rule: dense arrays of sticky status 
 information is good. So let's design a good interface for a dense array. 
 
 Basically, the perfect interface for events would be 
 
 struct event { 
 unsigned long id; /* file descriptor ID the event is on */ 
 unsigned long event; /* bitmask of active events */ 
 }; 
 
 int get_events(struct event * event_array, int maxnr, struct timeval 
*tmout); 

 int bind_event(int fd, struct event *event); 

http://www.FreeBSD.org/cgi/man.cgi?query=kqueueapropos=0sektion=0manpath=FreeBSD+5.0-currentformat=html
describes the FreeBSD kqueue interface for events:

 struct kevent {
 uintptr_t ident;/* identifier for this event */
 short filter;   /* filter for event */
 u_short   flags;/* action flags for kqueue */
 u_int fflags;   /* filter flag value */
 intptr_t  data; /* filter data value */
 void  *udata;   /* opaque user data identifier */
 };

 int kqueue(void);

 int kevent(int kq, const struct kevent *changelist, int nchanges,
 struct kevent *eventlist, int nevents,
 const struct timespec *timeout);

Pretty similar to yours, with the following additions:

Your proposal seems to only have one stream of available events per
process.  kqueue() returns a handle to an event queue, and kevent()
takes that handle as a first parameter.

Their proposal uses a single call to both bind (or unbind) an array of fd's
and pick up new events.  Probably reduces overhead.

Their proposal allows you to watch not just sockets, but also plain files
or directories.  (Hey, haven't we heard people talk about letting apps
do that under Linux?  What interface does that use?)

 The really nice part of the above is that it's trivial to implement. It's 
 about 50 lines of code, plus some simple logic to various drivers etc to 
 actually inform about the events. The way to do this simply is to limit it 
 in very clear ways, the most notable one being simply that there is only 
 one event queue per process (or rather, per "struct files_struct" - so 
 threads would automatically share the event queue). This keeps the 
 implementation simple, but it's also what keeps the interfaces simple: no 
 queue ID's to pass around etc. 

I dislike having only one event queue per process.  Makes it hard to use
in a library.  ("But wait, *I* was going to use bind_event()!")

 Advantage: everything is O(1), except for "get_event()" which is O(n) 
 where 'n' is the number of active events that it fills in. 

This is a big win, and is exactly the payoff that Solaris gets with /dev/poll.
 
 Example "server": 

See  http://www.monkeys.com/kqueue/echo.c for a very similar example server
for kqueue() / kevent().

 You get the idea. Very simple, and looks like it should perform quite 
 admirably. With none of the complexities of signal handling, and none of 
 the downsides of select() or poll(). Both of the new system calls would be 
 on the order of 20-30 lines of code (along with the infrastructure to 
 extend the fasync stuff to also be able to handle events) 

Go, Linus, go!  Let's do it!  But don't go *too* simple.  I really would
like a few of the things kqueue has.
 
 Yes, I'm probably missing something. But this is the kind of thing I think 
 we should look at (and notice that you can _emulate_ this with poll(), so 
 you can use this kind of interface even on systems that wouldn't support 
 these kinds of events natively). 

- Dan

p.s. my collection of notes on kqueue is at 
http://www.kegel.com/c10k.html#nb.kqueue
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dave Zarzycki

On Mon, 23 Oct 2000, Linus Torvalds wrote:

 where you say "I want an array of pending events, and I have an array you
 can fill with up to 'maxnr' events - and if you have no events for me,
 please sleep until you get one, or until 'tmout'".
 
 The above looks like a _really_ simple interface to me. Much simpler than
 either select() or poll(). Agreed?

Totally. I've been wanting an API like this in user-space for a long time.

One question about "int bind_event(int fd, struct event *event)"

Maybe I'm missing something, but why do you seperate out fd from the event
structure. Why not just "int bind_event(struct event *event)"

The only thing I might have done differently is allow for multiple event
queues per process, and the ability to add event queues to event
queues. But these features might not be all that useful in real life.

Hmmm... It might be cute if you could do something like this:

int num_of_resulting_events;
int r, q = open("/dev/eventq");
struct event interested_events[1024];
struct event events_that_happened[1024];

/* fill up interested_event_array */

write(q, interested_events, sizeof(interested_events));
r = read(q, events_that_happened, sizeof(events_that_happened));
num_of_resulting_events = r / sizeof(struct event);

davez

-- 
Dave Zarzycki
http://thor.sbay.org/~dave/



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Dan Kegel wrote:
 [kqueue is] Pretty similar to yours, with the following additions:
 
 Your proposal seems to only have one stream of available events per
 process.  kqueue() returns a handle to an event queue, and kevent()
 takes that handle as a first parameter.
 
 [kqueue] uses a single call to both bind (or unbind) an array of fd's
 and pick up new events.  Probably reduces overhead.
 
 [kqueue] allows you to watch not just sockets, but also plain files
 or directories.  (Hey, haven't we heard people talk about letting apps
 do that under Linux?  What interface does that use?)

I forgot to mention:

kqueue lets you associate an arbitrary integer with each event
specification; the integer is returned along with the event.
This is very handy for, say, passing the 'this' pointer of the
object that should handle the event.  Yes, you can simulate it
with an array indexed by fd, but I like having it included.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread H. Peter Anvin

Followup to:  [EMAIL PROTECTED]
By author:Dave Zarzycki [EMAIL PROTECTED]
In newsgroup: linux.dev.kernel
 
 Maybe I'm missing something, but why do you seperate out fd from the event
 structure. Why not just "int bind_event(struct event *event)"
 
 The only thing I might have done differently is allow for multiple event
 queues per process, and the ability to add event queues to event
 queues. But these features might not be all that useful in real life.
 

This could be useful if it doesn't screw up the implementation too
much.

Pretty much, what Linus is saying is the following:

   select()/poll() have one sizable cost, and that is to set up
   and destroy the set of events we want to trigger on.  We would
   like to amortize this cost by making it persistent.

It would definitely be useful for the user to have more than one such
"event set" installed at any one time, so that you can call different
wait_for_event() [or whatever] as appropriate.  However, if that means
we're doing lots of constructing and deconstructing in kernel space,
then we probably didn't gain much.

The other things I think we'd really like in a new interface is an
interface where you can explicitly avoid the "storming hordes" problem
-- if N servers is waiting for the same event, it should be at least
possible to tell the kernel to only wake up one (arbitrarily chosen)
of them, rather than all.

Finally, it would be somewhat nice to have a unified interface for
synchronous and asynchronous notification.  This should be quite
easily doable by adding a call event_notify(event_set,signal) that
causes real-time signal "signal" to be raised (presumably with the
event_set as the argument), when the specified event_set triggers.

-hpa

-- 
[EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Mon, 23 Oct 2000, Dan Kegel wrote:
 
 
http://www.FreeBSD.org/cgi/man.cgi?query=kqueueapropos=0sektion=0manpath=FreeBSD+5.0-currentformat=html
 describes the FreeBSD kqueue interface for events:

I've actually read the BSD kevent stuff, and I think it's classic
over-design. It's not easy to see what it's all about, and the whole kq,
ident, filter tuple crap is just silly. Looks much too complicated.

I don't believe in the "library" argument at all, and I think multiple
event queues completely detract from the whole point of being simple to
use and implement.

Now, I agree that my bind_event()/get_event() has limitations, and the
biggest one is probably the event "id". It needs to be better, and it
needs to have more structure. The "id" really should be something that not
only contains the "fd", but also contains the actor function to be called,
along with some opaque data for that function.

In fact, if you take my example server, and move the "handle[id]()" call
_into_ get_events() (and make the "handle[id]()" function pointer a part
of the ID of the event), then the library argument goes away too: it
doesn't matter _who_ calls the get_event() function, because the end
result is always going to be the same regardless of whether it is called
from within a library or from a main loop: it's going to call the function
handle associated with the ID that triggered.

Basically, the main loop would boil down to

for (;;) {
static struct event ev_list[MAXEV];
get_event(ev_list, MAXEV, tmout);
.. timeout handling here ..
}

because get_even() would end up doing all the user-mode calls too (so
"get_event()" is no longer a system call: it's a system call + a for-loop
to call all the ID handler functions that were associated with the events
that triggered).

So the "struct event" would just be:

struct event {
int fd;
unsigned long mask;
void *opaque;
void (*event_fn)(ind fd, unsigned long mask, void *opaque);
}

and there's no need for separate event queues, because the separate event
queues have been completely subsumed by the fact that every single event
has a separate event function.

So now you'd start everything off (assuming the same kind of "listen to
everything and react to it" server as in my previous example) by just
setting

bind_event(sock, POLLIN, NULL, accept_fn);

which basically creates the event inside the kernel, and will pass it to
the "__get_event()" system call through the event array, so the
get_event() library function basically looks like

int get_event(struct event *array, int maxevents, struct timeval *tv)
{
int nr = __get_event(array, maxevents, tv);
int i;
for (i = 0; i  nr; i++) {
array-event_fn(array-fd, array-mask, array-opaque);
array++;
}
return nr;
}

and tell me why you'd want to have multiple event queues any more?

(In fact, you might as well move the event array completely inside
"get_event()", because nobody would be supposed to look at the raw array
any more. So the "get_event()" interface would be even simpler: just a
timeout, nothing more. Talk about simple programming.

(This is also the ideal event programming interface - signals get racy and
hard to handle, while in the above example you can trivially just be
single-threaded. Which doesn't mean that you CANNOT be multi-threaded if
you want to: you multi-thread things by just having multiple threads that
all call "get_event()" on their own).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Mon, 23 Oct 2000, Dan Kegel wrote:
 
 kqueue lets you associate an arbitrary integer with each event
 specification; the integer is returned along with the event.
 This is very handy for, say, passing the 'this' pointer of the
 object that should handle the event.  Yes, you can simulate it
 with an array indexed by fd, but I like having it included.

I agree. I thin the ID part is the most important part of the interfaces,
because you definitely want to re-use the same functions over and over
again - and you wan tto have some way other than just the raw fd to
identify where the actual _buffers_ for that IO is, and what stage of the
state machine we're in for that fd etc etc.

Also, it's potentially important to have different "id"s for even the same
fd - in some cases you want to have the same event handle both the read
and the write part on an fd, but in other cases it might make more
conceptual sense to separate out the read handling from the write
handling, and instead of using "mask = POLLIN | POLLOUT", you'd just have
two separate events, one with POLLIN and one with POLLOUT.

This was what my "unsigned long id" was, but that is much too hard to use.
See my expanded suggestion of it just a moment ago.

(And yes, I'm sure you can do all this with kevents. I just abhor the
syntax of those things).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Lincoln Dale

At 10:39 PM 23/10/2000 -0700, Linus Torvalds wrote:
First, let's see what is so nice about "select()" and "poll()". They do
have one _huge_ advantage, which is why you want to fall back on poll()
once the RT signal interface stops working. What is that?

RT methods are bad if they consume too many resources.  SIGIO is a good 
example of this - the current overhead of passing events to user-space 
incurs both a spinlock and a memory copy of 512 bytes for each 
event.  while it removes the requirement to "walk lists", the signal 
semantics in the kernel and the overhead of memory copies to userspace 
negate its performance a fair bit.

that isn't to say that all "event-driven" methods are bad.  in the past 
year, i've done many experiments at making SIGIO more efficient.

some of these experiments include --
  [1] 'aggregate' events.  that is, if you've registered a POLL_IN, no need
  to registered another POLL_IN
  this was marginally successful, but ultimately still didn't scale.

  [2] create a new interface for event delivery.

for i settled on a 16-byte structure sufficient to pass all of the relevant 
information:
 typedef struct zerocopy_buf {
 int fd;
 short int   cmd;
 #define ZEROCOPY_VALID_BUFFER   0xe1e2
 short int   valid_buffer;
 void*buf; /* skbuff */
 #ifdef __KERNEL__
 volatile
 #endif
 struct zerocopy_buf *next;
 } zerocopy_buf_t;

so, we get down to 16 bytes per-event.  these are allocated

coupled with this was an interface whereby user-space could view 
kernel-space (via read-only mmap).
in my case, this allowed for user-space to be able to read the above chain 
of zerocopy_buf events with no kernel-to-user memory copies.

an ioctl on a character driver could ask the kernel to give it the head of 
the chain of the current zerocopy_buf structure.  a similar ioctl() call 
allows it to pass a chain of instructions to the kernel (adding/removing 
events from notification) and other housekeeping.

since user-space had read-only visibility into kernel memory address-space, 
one could then pick up skbuff's in userspace without the overhead of copies.

... and so-on.

the above is a bit of a simplification of what goes on.  using flip-buffers 
of queues, one can use this in multiple processes and be SMP-safe without 
the requirements for spinlocks or semaphores in the "fast path".  solving 
the "walk the list of fd's" and "incur the overhead of memory copies" tied 
in with network hardware capable of handling scatter/gather DMA and IP and 
TCP checksum calculations, i've more than doubled the performance of an 
existing application which depended on poll()-type behaviour.

while i agree that it isn't necessarily a 'generic' interface, and won't 
necessarilly appeal to everyone as the cure-all, the techniques used have 
removed two significant bottlenecks to high-network-performance i/o on 
tens-of-thousands of TCP sockets for an application we've been working on.



cheers,

lincoln.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
 Basically, the main loop would boil down to
 
 for (;;) {
 static struct event ev_list[MAXEV];
 get_event(ev_list, MAXEV, tmout);
 .. timeout handling here ..
 }
 
 because get_even() would end up doing all the user-mode calls too (so
 "get_event()" is no longer a system call: it's a system call + a for-loop
 to call all the ID handler functions that were associated with the events
 that triggered).
 
 So the "struct event" would just be:
 
 struct event {
 int fd;
 unsigned long mask;
 void *opaque;
 void (*event_fn)(ind fd, unsigned long mask, void *opaque);
 }
 
 and there's no need for separate event queues, because the separate event
 queues have been completely subsumed by the fact that every single event
 has a separate event function.

OK, guess I buy the one-queue-to-bind-them-all argument.

Might be good to pick more unique names than 'struct event' and 'get_event'.
People probably use those already.

Hiding ev_list is probably ok.  However,
http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf
suggests that knowing how many events are pending is a useful measure 
of server load, and that if more than a certain number of events
are pending, web servers should reject new connections.  Thus it might
be handy to make the return value of get_event be the number of events
gotten.
 
 So now you'd start everything off (assuming the same kind of "listen to
 everything and react to it" server as in my previous example) by just
 setting
 
 bind_event(sock, POLLIN, NULL, accept_fn);

A couple questions:

* how do you unbind?  (By calling bind_event with NULL as the accept_fn?)

* how do you change a mask?  (By calling bind_event with a new mask?)

* Is it ok to do these things while in an event_fn?  (Yes?)

* Do you get an event whenever an fd is ready for something, or
  only when its readiness changes?  (Presumably whenever an fd is ready for something?)

 (This is also the ideal event programming interface - signals get racy and
 hard to handle, while in the above example you can trivially just be
 single-threaded. Which doesn't mean that you CANNOT be multi-threaded if
 you want to: you multi-thread things by just having multiple threads that
 all call "get_event()" on their own).

I for one will be glad to not have to think of the races
caused by the delay between signal queueing and signal pickup.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
 Basically, the main loop would boil down to
 for (;;) {
 static struct event ev_list[MAXEV];
 get_event(ev_list, MAXEV, tmout);
 .. timeout handling here ..
 }
 
 because get_even() would end up doing all the user-mode calls too (so
 "get_event()" is no longer a system call: it's a system call + a for-loop
 to call all the ID handler functions that were associated with the events
 that triggered).

Occurs to me that this stuff will be used from other languages
than C, which might prefer to do their own event dispatching,
and would want the system to treat event_fn as another opaque quantity.

So I don't object to a function that dispatches events, e.g.
 int get_event(tmout);
as long as there's also 
 int get_events(ev_list, MAXEV, tmout)
that treats event_fn as an opaque pointer and *does not* dispatch the events.

These probably both want to be C library functions (rather than
get_events being only a system call) on the chance that some
operating systems will implement Linux emulation at the
shared library level rather than the syscall level.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Andi Kleen

On Mon, Oct 23, 2000 at 09:06:11PM -0700, Linus Torvalds wrote:
 
 
 On Tue, 24 Oct 2000, Andi Kleen wrote:
  
  I don't see the problem. You have the poll table allocated in the kernel,
  the drivers directly change it and the user mmaps it (I was not proposing
  to let poll make a kiobuf out of the passed array) 
 
 That's _not_ how poll() works at all.

But how /dev/poll would work. Sorry for not marking it clearly. 

 
 We don't _have_ a poll table in the kernel, and no way to mmap it. The
 poll() tables gets created dynamically based on the stuff that the user
 has set up in the table. And the user can obviously change the fd's etc in
 the table directly, so in order for the caching to work you need to do
 various games with page table dirty or writable bits, or at least test
 dynamically whether the poll table is the same as it was before.

Actually assuming the user does not duplicate fds in the poll array it
would always work out the lazy way (and when he duplicates fds I would
say the behaviour is undefined -- no spec says in what order poll
must process fds in the supplied poll table)

You either find the correct fd in the offset you cached, and if not
the user has changed the table and you need to recompute all cached offsets. 

No need to play pte tricks. 

 
 Sure, it's doable, and apparently Solaris does something like this. But
 what _is_ the overhead of the Solaris code for small number of fd's? I bet
 it really is quite noticeable. I also suspect it is very optimized toward
 an unchangning poll-table.

For small number of fds you use the fast_poll/fast_select that I implemented 
in the patch I sent to you.


 
  What is your favourite interface then ? 
 
 I suspect a good interface that can easily be done efficiently would
 basically be something where the user _does_ do the equivalent of a
 read-only mmap() of poll entries - and explicit and controlled
 "add_entry()" and "remove_entry()" controls, so that the kernel can
 maintain the cache without playing tricks.

I think when you say the case of duplicated fd entries in an poll array
as undefined you don't need the controlled add/remove_entry.
[and from a quick glimpse at Single Unix it does not say anything about
the order of fd processing in poll, so the spec is definitely flexible
enough for that]


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Felix von Leitner

Thus spake Linus Torvalds ([EMAIL PROTECTED]):
 I disagree.

 Let's just face it, poll() is a bad interface scalability-wise.

Is that a reason to implement it badly?

Felix
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Mitchell Blank Jr

Linus Torvalds wrote:
 Here's a suggested "good" interface that would certainly be easy to
 implement, and very easy to use, with none of the scalability issues that
 many interfaces have.

I think everyone should take a timeout and look at Solaris 8's /dev/poll
interface.  This discussion is reinventing the wheel, the lever, and the
inclined plane.

http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123

I think this is a lot cleaner than your approach:
  * it doesn't add extra syscalls

  * it handles multiple event queues, and does so without ugliness.
all the per-queue state is held in the /dev/poll's "struct file"
instance

  * in your method you have a per-process queue - but under what clone()
conditions is it shared among siblings?  Here the user has a choice -
they can share the open("/dev/poll") file descriptor or not using
any of the existing means.  Granted, they also would probably want
to arrange to share the fd's being polled in order to make this
useful.

  * No new fields in task_struct

A few simple improvements can be made to the Sun model, though:

  * The fact that the fd of /dev/poll can't be used for poll(2) or select(2)
is ugly.  Sure, you probably don't want an open instance of /dev/poll
being added to another /dev/poll, but being able to call "select" on
them would be really useful:
  1. Imagine a server that has to process connections from both
 high-priority and low-priority clients - and that requests from
 the high-priority ones always take precedence.  With this
 interface you could easily have two open instances of /dev/poll
 and then call select(2) on them.  This ability just falls
 out naturally from the interface.
  2. Some libraries are internally driven by select(2) loops (I think
 Xlib is like this, IIRC)  If you have a lot of fd's you want to
 watch, this means you must take the hit of calling select(2) on
 all of them.  If you could just pass in a fd for /dev/poll,
 problem solved.

  * I think the fact that you add events via write(2) but retrieve them
via ioctl(2) is an ugly asymmetry.  From what I can see, the entire
motivation for using ioctl as opposed to read(2) is to allow the user
to specify a timeout.  If you could use poll(2)/select(2) on /dev/poll
this issue would be moot (see above)

  * It would be nice if the interface were flexible enough to report
items _other_ than "activity on fd" in the future.  Maybe SYSV IPC?
itimers?  directory update notification?  It seems that every time
UNIX adds a mechanism of waiting for events, we spoil it by not
making it flexible enough to wait on everything you might want.
Lets not repeat the mistake with a new interface.

  * The "struct pollfd" and "struct dvpoll" should also include a 64-bit
opaque quantity supplied by userland which is returned with each event
on that fd.  This would save the program from having to look up
which connection context goes with each fd - the kernel could just
give you the pointer to the structure.  Not having this capability isn't
a burden for programs dealing with a small number of fd's (since they
can just have a lookup table) but if you potentially have 1's of
connections it may be undesirable to allocate an array for them all.

The linux patch of /dev/poll implements mmap'ing of the in-kenrel poll
table... I don't think this is a good idea.  First, the user just wants to
be able to add events and dequeue them - both linear operations.  Second,
why should the kernel be forced to maintain a fixed-size linear list of
events we're looking for... this seems mindlessly inefficient.  Why not
just pull a "struct pollfd" out of slab each time a new event is listened
for?

My unresolved concerns:
  * Is this substantially better than the already existing rtsig+poll
solution?  Enough to make implementing it worth while?

  * How do we quickly do the "struct file" - "struct pollfd" conversion
each time an event happens?  Especially if there are multiple /dev/poll
instances open in the current process.  Probably each "struct file"
would need a pointer to the instance of /dev/poll which would have
some b-tree variant (or maybe a hash table).  The limitation would
be that a single fd couldn't be polled for events by two different
/dev/poll instances, even for different events.  This is probably
good for sanity's sake anyway.

-Mitch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds


[ Moving on to practical matters ]

On Tue, 24 Oct 2000, Dan Kegel wrote:
 
 Might be good to pick more unique names than 'struct event' and 'get_event'.
 People probably use those already.

I agree. I would never implement them under those names, but it's easier
to talk about "event" than about something very specific.

 Hiding ev_list is probably ok.  However,
 http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf
 suggests that knowing how many events are pending is a useful measure 
 of server load, and that if more than a certain number of events
 are pending, web servers should reject new connections.  Thus it might
 be handy to make the return value of get_event be the number of events
 gotten.

Note that my "get_event()" function actually did return exactly that: it
returned the number of events, even though the actual event array handling
was hidden inside the library function.

  So now you'd start everything off (assuming the same kind of "listen to
  everything and react to it" server as in my previous example) by just
  setting
  
  bind_event(sock, POLLIN, NULL, accept_fn);
 
 A couple questions:
 
 * how do you unbind?  (By calling bind_event with NULL as the accept_fn?)

If done that way, I'd prefer something where an event _mask_ of 0 means
that the event obviously no longer exists, as it can no longer trigger.
The accept_fn would be part of the identifier, and as such zeroing it out
would lose the identification.

But I'd personally probably prefer to be more explicit, and have a
separate function for it. 

Note that whether it's a separate function or just making "bind_event()"
really be "modify_or_bind_or_remove_event()" is pretty much not anything I
care about - it gets to be so low-level details that it would be something
that is probably best tried out and something that I don't have any
religious issues with.

I care about high-level interfaces, because _those_ are the ones that
screw me years down the line. For example, select() (and later bind()) was
always an interface that did _not_ fit into the UNIX way of doing things
AT ALL. It really stands out among all the IO interfaces as being very
different. And it's one of the nastier ones to maintain.

 * how do you change a mask?  (By calling bind_event with a new mask?)

Same as above. Either bind with a new mask (zero being remove), or
separate call.

 * Is it ok to do these things while in an event_fn?  (Yes?)

Oh, absolutely. The kernel doesn't even _know_ when something is in an
event function - the kernel only sees "change the event" and "get the list
of events".

 * Do you get an event whenever an fd is ready for something, or
   only when its readiness changes?  (Presumably whenever an fd is ready for 
something?)

Only when its readiness changes - probably with the addition that it would
simplify things that a new event always (unconditionally) gets added to
the event queue.

Note that there are no races in this implementation - there is no way for
events to get lost, even trivially seen in SMP/threaded environments.
That's important. 

The BSD kevent paper goes on about "level and edge triggered" and it
becomes a big thing for them, and they selected level-triggered events as 
if it made any difference. And sure - it _does_ make a difference, but the
only difference is in how hard it is to implement, and level-triggered is
a noticeably harder.

The bind_event()/get_events() interface is edge-triggered, and works
fundamentally the same way as SIGCHLD. If you don't keep up, you get
SIGCHLD only once, even if ten children died on you in rapid succession.
And it's edge-triggered: that doesn't mean that you lose events (like the
BSD paper implies), it only means that you have to reap your children with
a loop:

while ((err = waitpid(..)) = 0) {
.. get all children ..
}

The reason "edge-triggered" (ie only when an event changes) is preferable
is that it's MUCH simpler, and means that the "get_events()" system call
doesn't need to actually understand what the events mean at all. It will
just take the events off the list. If anythig triggers after that, it will
be put back on the list again, but you can guarantee that you don't get
any lost events simply by the simple fact that

 - by the time get_events() returns to user mode (ie before the action
   functions have been called), any new event that happens even before we
   have had to react to the old events will cause the event to be queued
   up again.

What does this mean? It basically means that if you continue to get events
on a file descriptor during your event handler function, the event will
have moved back to the event list, and next time you _will_ get an event
again.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Dan Kegel wrote:
 Linus Torvalds wrote:
  Basically, the main loop would boil down to
  for (;;) {
  static struct event ev_list[MAXEV];
  get_event(ev_list, MAXEV, tmout);
  .. timeout handling here ..
  }
  
  because get_even() would end up doing all the user-mode calls too (so
  "get_event()" is no longer a system call: it's a system call + a for-loop
  to call all the ID handler functions that were associated with the events
  that triggered).
 
 Occurs to me that this stuff will be used from other languages
 than C, which might prefer to do their own event dispatching,
 and would want the system to treat event_fn as another opaque quantity.

Yes and no.

The KERNEL would treat it as just another opaque quantity: after all, it
would never ever touch the thing other than get it as part of
"bind_event()" and return it as part of "get_events()". So as far as the
kernel is concerned, both "event_fn" and "opaque" are just random
information.

However, it is very important to have every common user agree about the
meaning of the ID's in user space: the whole point of encapsulating
"event_fn" as a function pointer is that you can have different parts of
the same program use the "bind_event()" interface, and they won't step on
each others toes.

So if you program in Fortran, for example, and you expect to link against
the C library, and the C library uses events for something, then you'd
better make sure that the fortran interfaces end up using the event_fn
thing in the same wasy as the C one..

 So I don't object to a function that dispatches events, e.g.
  int get_event(tmout);
 as long as there's also 
  int get_events(ev_list, MAXEV, tmout)
 that treats event_fn as an opaque pointer and *does not* dispatch the events.

That's a user-mode issue, and obviously, if you don't care about
compatibility with anything else you can do this. The kernel won't know,
or care. But I think it is unlikely that you'd ever use the raw events in
any other way - most environments under UNIX tend to have a C library
component somewhere, because the low-level stuff is written in C and uses
the common code that way even when the programmer isn't aware of it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Mitchell Blank Jr wrote:
 I think everyone should take a timeout and look at Solaris 8's /dev/poll
 interface.  This discussion is reinventing the wheel, the lever, and the
 inclined plane.
 
   http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123
 
 I think this is a lot cleaner than your approach:

You must be kidding.

/dev/poll is the interface from _hell_. It's absolutely disgusting.

   * it doesn't add extra syscalls

Sure it does. 

What do you think ioctl's are? Every single ioctl number is an extra
system call. It's just indirected inside the device driver instead of at
the normal system call entry point. The advantage? It's noticeably slower
than a real system call.

And the final reason why /dev/poll should just go away: not only is the
interface ugly, but it's hard to use too. It's really hard to mix
different users of /dev/poll - you'll have to create a whole new event
structure around it.

Sure, you can have multiple "poll lists" by opening /dev/poll multiple
times, and you seem to see this as a great feature. I'm telling you that
multiple event queues are a horrible disaster, and it's a perfect example
of what an engineer with the "more is better" mindset comes up with.

Multiple event queues are bad, because it completely breaks the notion of
even-driven programming. How do you want to listen to them all? You can't.
You can only listen to one event queue at a time - unless you create some
kind of "hierarchy" of event queues, where you have one event queue that
you wait on that contains the _other_ event queues, so that you can tell
which one has anything pending..

Trust me. Multiple event queues are idiotic. Anybody who _designs_ for
multiple event queues has some serious problems.

Basically, tastes differ. Everything you seem to see as an advantage of
/dev/poll, I see as a horrible kludge.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Simon Kirby

On Mon, Oct 23, 2000 at 10:39:36PM -0700, Linus Torvalds wrote:

 Actually, forget the mmap, it's not needed.
 
 Here's a suggested "good" interface that would certainly be easy to
 implement, and very easy to use, with none of the scalability issues that
 many interfaces have.
...
 Basically, the perfect interface for events would be
 
   struct event {
   unsigned long id;   /* file descriptor ID the event is on */
   unsigned long event;/* bitmask of active events */
   };
 
   int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

I like. :)

However, isn't there already something like this, albeit maybe without
the ability to return multiple events at a time?  When discussing
select/poll on IRC a while ago with sct, sct said:

 sct Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for
   SIGIO as usual.
 sct Simon: Then fcntl(fd, F_SETSIG, rtsignum)
 sct Simon: And you'll get a signal queue which passes you the fd of
   each SIGIO in turn.
 Simon sct: easy :)
 sct Simon: You don't even need the overhead of a signal handler:
   instead of select(), you just do "sigwaitinfo(siginfo, timeout)"
   and it will do a select-style IO wait, returning the fd in the
   siginfo when it's available.

(Captured from IRC on Nov 12th, 1998.)

Or does this menthod still have the overhead of encapsulating the events
into signals within the kernel?

Also, what is different in your above interface that prevents it from
being able to queue up too many events?  I guess the structure is only
sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000
FDs on x86, but I don't see how the other method would be significantly
different.  The kernel would have to store the queued events still,
surely...

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Simon Kirby wrote:
 
 However, isn't there already something like this, albeit maybe without
 the ability to return multiple events at a time?  When discussing
 select/poll on IRC a while ago with sct, sct said:
 
  sct Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for
SIGIO as usual.

[ siginfo and RT signals, but using "sigwaitinfo()" instead of actual
  signal handlers ]

The problem with RT signals (whether you use sigwaitinfo() or signal
handling overhead) is that they don't scale to lots of events. The memory
pressure for keeping track of millions of siginfo structures is more than
any system can handle - so all RT signals implementations will have to
eventually fall back on something else.

Which makes for really hard to debug bugs, and complex programming (you
essentially end up having to duplicate all of the event handling, and the
"fallback" case has the added difficulty that you won't even see it except
under heavy load). 

 Also, what is different in your above interface that prevents it from
 being able to queue up too many events?

Basically, with get_events(), there is a maximum of one event per "bind".
And the memory for that is statically allocated at bind_event() time. 

 I guess the structure is only
 sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000
 FDs on x86, but I don't see how the other method would be significantly
 different.  The kernel would have to store the queued events still,
 surely...

Oh, the bind information would be more like 32 bytes per bind, so assuming
you have 2 fd's you're waiting for, you'd certainly be using a lot of
memory. Still less than the quite complex SIGIO structures, though.

But you'd be doing so in a controlled manner: the memory use wouldn't go
up just because there is a sudden influx of 5 packets. So it scales
with load by virtue of simply not _caring_ about the load - it only cares
about the number of fd's you're waiting on.

(Obviously _other_ parts of the system will have to keep up with tons of
new packets, but that's why protocols like TCP have flow control, and
that is something that is done independently of the event notification,
so regardless of _what_ kind of event model you have you have to handle
the simple load of TCP ;).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Abramo Bagnara

Linus Torvalds wrote:
 
 
 struct event {
 int fd;
 unsigned long mask;
 void *opaque;
 void (*event_fn)(ind fd, unsigned long mask, void *opaque);

My experience say that:

unsigned long rmask;
void (*event_fn)(struct event *event);

is a far better solution (more type safe, more descriptive).


-- 
Abramo Bagnara   mailto:[EMAIL PROTECTED]

Opera Unica
Via Emilia Interna, 140  Phone: +39.0546.656023
48014 Castel Bolognese (RA) - Italy  Fax:   +39.0546.656023

ALSA project ishttp://www.alsa-project.org
sponsored by SuSE Linuxhttp://www.suse.com

It sounds good!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Simon Kirby

On Tue, Oct 24, 2000 at 10:03:04AM -0700, Linus Torvalds wrote:

 Basically, with get_events(), there is a maximum of one event per "bind".
 And the memory for that is statically allocated at bind_event() time. 
... 
 But you'd be doing so in a controlled manner: the memory use wouldn't go
 up just because there is a sudden influx of 5 packets. So it scales
 with load by virtue of simply not _caring_ about the load - it only cares
 about the number of fd's you're waiting on.

Nice.  I like this.

It would be easy for existing userspace code to start using this
interface as it has the same semantics as select/poll from the event
perspective.  But it would make things even easier, as the bind would
follow the life of the descriptor and thus wouldn't need to be "requeued"
before every get_events call, so that part of userspace code could just
be ripped out^W^W disabled and kept only for portability.

In most of the daemons I have written, I've ended up using memcpy() to
keep a non-scribbled-over copy of the fdsets around so I don't have to
walk data structures and requeue fds on every loop for select()...nasty.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Lee Chin

There is only one thiong I don't understand about this... why can't we
re-implement the poll() implementation of Linux instead of introducing
another system call?

If I understood Linux correctly, what he is saying is that the bind_event
system call is needed to give the kernel a hint that the application is
interested in a certain event associated with a file descriptor.

If the kernel kept such an event queue per process any way (as soon as the
process opened the file/socket)... then the poll implementation would be
exactly like the get_events system call.  What is wrong with this?

Thanks
Lee

On Mon, 23 Oct 2000, Linus Torvalds wrote:

  What is your favourite interface then ?

 I suspect a good interface that can easily be done efficiently would
 basically be something where the user _does_ do the equivalent of a
 read-only mmap() of poll entries - and explicit and controlled
 "add_entry()" and "remove_entry()" controls, so that the kernel can
 maintain the cache without playing tricks.

Actually, forget the mmap, it's not needed.

Here's a suggested "good" interface that would certainly be easy to
implement, and very easy to use, with none of the scalability issues that
many interfaces have.

First, let's see what is so nice about "select()" and "poll()". They do
have one _huge_ advantage, which is why you want to fall back on poll()
once the RT signal interface stops working. What is that?

Basically, RT signals or any kind of event queue has a major fundamental
queuing theory problem: if you have events happening really quickly, the
events pile up, and queuing theory tells you that as you start having
queueing problems, your latency increases, which in turn tends to mean
that later events are even more likely to queue up, and you end up in a
nasty meltdown schenario where your queues get longer and longer.

This is why RT signals suck so badly as a generic interface - clearly we
cannot keep sending RT signals forever, because we'd run out of memory
just keeping the signal queue information around.

Neither poll() nor select() have this problem: they don't get more
expensive as you have more and more events - their expense is the number
of file descriptors, not the number of events per se. In fact, both poll()
and select() tend to perform _better_ when you have pending events, as
they are both amenable to optimizations when there is no need for waiting,
and scanning the arrays can use early-out semantics.

So sticky arrays of events are good, while queues are bad. Let's take that
as one of the fundamentals.

So why do people still like RT signals? They do have one advantage, which
is that you do NOT have that silly array traversal when there is nothing
to do. Basically, the RT signals kind of approach is really good for the
cases where select() and poll() suck: no need to traverse mostly empty and
non-changing arrays all the time.

It boils down to one very simple rule: dense arrays of sticky status
information is good. So let's design a good interface for a dense array.

Basically, the perfect interface for events would be

struct event {
unsigned long id; /* file descriptor ID the event is on */
unsigned long event; /* bitmask of active events */
};

int get_events(struct event * event_array, int maxnr, struct timeval
*tmout);

where you say "I want an array of pending events, and I have an array you
can fill with up to 'maxnr' events - and if you have no events for me,
please sleep until you get one, or until 'tmout'".

The above looks like a _really_ simple interface to me. Much simpler than
either select() or poll(). Agreed?

Now, we still need to inform the kernel of what kind of events we want, ie
the "binding" of events. The most straightforward way to do that is to
just do a simple "bind_event()" system call:

int bind_event(int fd, struct event *event);

which basically says: I'm interested in the events in "event" on the file
descriptor "fd". The way to stop being interested in events is to just set
the event bitmask to zero.

Now, the perfect interface would be the above. Nothing more. Nothing
fancy, nothing complicated. Only really simple stuff. Remember the old
rule: "keep it simple, stupid".

The really nice part of the above is that it's trivial to implement. It's
about 50 lines of code, plus some simple logic to various drivers etc to
actually inform about the events. The way to do this simply is to limit it
in very clear ways, the most notable one being simply that there is only
one event queue per process (or rather, per "struct files_struct" - so
threads would automatically share the event queue). This keeps the
implementation simple, but it's also what keeps the interfaces simple: no
queue ID's to pass around etc.

Implementation is straightforward: the event queue basically consists of

- a queue head in "struct files_struct", initially empty.

- doing a "bind_event()" basically adds a fasync entry to the file
structure, but rather than cause a signal, it just looks at 

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
  * Do you get an event whenever an fd is ready for something, or
only when its readiness changes?  (Presumably whenever an fd is ready for 
something?)
 
 Only when its readiness changes - probably with the addition that it would
 simplify things that a new event always (unconditionally) gets added to
 the event queue.

 Note that there are no races in this implementation - there is no way for
 events to get lost, even trivially seen in SMP/threaded environments.
 That's important.

But user code currently written for poll() has the luxury of dropping
events because poll() will happily report on the current readiness of
the socket every time.  /dev/poll is level-triggered because it's trying
to make conversion of poll()-based code easy.  With your scheme, 
whatever user code is receiving the edges better darn well do something
about them, because it's only going to get them once.
 
 The BSD kevent paper goes on about "level and edge triggered" and it
 becomes a big thing for them, and they selected level-triggered events as
 if it made any difference. And sure - it _does_ make a difference, but the
 only difference is in how hard it is to implement, and level-triggered is
 a noticeably harder.

I don't see why edge triggered is that much harder.  All it adds is
a layer which receives the edges and moves fds back and forth between
a 'ready' list and a 'not ready' list.  Easy as pie.

 The reason "edge-triggered" (ie only when an event changes) is preferable
 is that it's MUCH simpler, and means that the "get_events()" system call
 doesn't need to actually understand what the events mean at all. 

Not much understanding is required on the part of the edge-to-level filter.

That said, the edge-to-level filter can be in userspace, and the kernel
part can be just edge-triggered -- as long as the kernel delivers the
downward edges as well as the upward edges.   We could do that by
adding NOTPOLLIN, NOTPOLLOUT, etc., i.e. events that are delivered
when a socket becomes UN-ready for reading.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Dan Kegel wrote:
 
 But user code currently written for poll() has the luxury of dropping
 events because poll() will happily report on the current readiness of
 the socket every time.  /dev/poll is level-triggered because it's trying
 to make conversion of poll()-based code easy.  With your scheme, 
 whatever user code is receiving the edges better darn well do something
 about them, because it's only going to get them once.

Oh, I agree. I'm not saying that my approach magically fixes bugs in user
space ;)

  The BSD kevent paper goes on about "level and edge triggered" and it
  becomes a big thing for them, and they selected level-triggered events as
  if it made any difference. And sure - it _does_ make a difference, but the
  only difference is in how hard it is to implement, and level-triggered is
  a noticeably harder.
 
 I don't see why edge triggered is that much harder.  All it adds is
   level
 a layer which receives the edges and moves fds back and forth between
 a 'ready' list and a 'not ready' list.  Easy as pie.

Not true.

For example, if you're truly level-triggered, and you have a socket that
gets data, the event move onto the event queue. So far so fine: both edge
and level agree about this one.

The point they disagree is when the event gets removed from the event
queue. For edge triggered, this one is trivial: when a get_events() thing
happens and moves it into user land. This is basically a one-liner, and it
is local to get_events() and needs absolutely no help from anybody else.
So obviously event removal is _very_ simple for edge-triggered events -
the INTACK basically removes the event (and also re-arms the trigger
logic: which is different from most interrupt controllers, so the analogy 
falls down here).

For level, the thing is not as easy at ALL. Suddenly removal becomes a big
issue, and needs help from the actual driver. You can do it two ways:
calling down to the driver when you remove (to see if the event should be
dismissed or not once it has been read) or have the driver pro-actively
remove the event whenever a read() happens (or whatever that undoes the
event).

Both are actually fairly hard. Much harder than they sound. For different
reasons.

 - the callback approach at get_events() time sounds trivial, but actually
   has two problems: cache footprint for "get_events()" grows a _lot_
   (because the events are likely to be spread out a lot if there are a
   lot of them pending, so you don't get a nice tight inner loop at all),
   and you get "double events" - by the time the event first happens, it
   will still be active, so we cannot actually remove it at that time
   (there is still data to be read - and the event doesn't go away until
   we read it) so we'll get the event _again_, and on the next
   get_events() it will notice that the event was bogus, and remove it
   (and we can optimize it away from reporting it to user land at that
   point, so only the kernel needs to look at it twice and do two
   callbacks)

 - the "proactively remove events when the thing that triggerred them goes
   away" approach means that each anti-event (like a read that empties the
   buffer) needs to undo it's events, but it also needs to be careful that
   it doesn't undo combined events, and it needs to be very careful about
   races (new packet coming in), so the proactive remove actually ends up
   being less than trivial - and in a performance-critical section.

Now, compare that to a one-liner that just does a "list_del(event-list)"
as it copies over the event to user mode. Woudln't you say that the
edge-triggered version is simpler?

  The reason "edge-triggered" (ie only when an event changes) is preferable
  is that it's MUCH simpler, and means that the "get_events()" system call
  doesn't need to actually understand what the events mean at all. 
 
 Not much understanding is required on the part of the edge-to-level filter.

Implement it, and see. I bet you'll find that it gets really interesting
when you have concurrent accesses to the same file descriptor etc.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Abramo Bagnara wrote:

 Linus Torvalds wrote:
  
  
  struct event {
  int fd;
  unsigned long mask;
  void *opaque;
  void (*event_fn)(ind fd, unsigned long mask, void *opaque);
 
 My experience say that:
 
   unsigned long rmask;
   void (*event_fn)(struct event *event);
 
 is a far better solution (more type safe, more descriptive).

You're probably right. It certainly makes it easier to add fields later on
if we'd want to extend the ID part without having to change old users..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Mitchell Blank Jr

Linus Torvalds wrote:
* it doesn't add extra syscalls
 
 Sure it does. 
 
 What do you think ioctl's are?

As I explained a few lines down from where you stopped quoting (and probably
stopped reading) the ioctl() use is just an artifact of Solaris's icky
implementation.  It could and should use read().  Not an extra syscall.

 Sure, you can have multiple "poll lists" by opening /dev/poll multiple
 times, and you seem to see this as a great feature. I'm telling you that
 multiple event queues are a horrible disaster, and it's a perfect example
 of what an engineer with the "more is better" mindset comes up with.

It has nothing to do with how exciting of a feature it is.  It's a matter
"hey, if you implement it with /dev/poll, you get multiple queues FOR
FREE".  Those sorts of things are an indication that you've got the
implementation right.  The fact that in your scheme multiple queues would
be extra work is just residue from the fact that you've got the idea wrong.

And how do you propose dealing with multi-threaded programs.  Obviously for
SMP you'd want multiple threads to be able to grab events, so I guess the
"one great queue" will be shared among all the threads.  But, suppose I
want to dedicate a thread to coping with a specific set of events.  H...
guess you'll need a CLONE_EVENTQUEUE flag to support that.  With /dev/poll
none of this foolishness is needed - you just open tow of them and let
the threads do what they want.

 Multiple event queues are bad, because it completely breaks the notion of
 even-driven programming. How do you want to listen to them all? You can't.

Explained in detail in my mail.

 You can only listen to one event queue at a time - unless you create some
 kind of "hierarchy" of event queues, where you have one event queue that
 you wait on that contains the _other_ event queues, so that you can tell
 which one has anything pending..

No, event queues of event queues are messy in implementation.  There's
nothing fundamentally wrong with them, but you need to make sure the
graph stays acyclic which is just an implementation hassle.  Not enough
gain for too much code (unless someone has a beautiful algorithm)

However, it's perfectly reasonable to use select() or poll() on two event
queues.  As I explained in my email, this is needed if you're already using
a library that does its own event queue based on poll or select.

 Basically, tastes differ. Everything you seem to see as an advantage of
 /dev/poll, I see as a horrible kludge.

Well, everything I see indicates that /dev/poll would be the easier of
the two interfaces to actually implement.  All the added flexibility is
just gravy.

-Mitch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Miquel van Smoorenburg

In article [EMAIL PROTECTED],
Linus Torvalds  [EMAIL PROTECTED] wrote:
   struct event {
   unsigned long id;   /* file descriptor ID the event is on */
   unsigned long event;/* bitmask of active events */
   };
   int bind_event(int fd, struct event *event);

Shouldn't there also be a way to add non-filedescriptor based events
into this, such as "child exited" or "signal caught" or shm things?

Or should there simply be ways to make those events available
through filedescriptors anyway? Then you'd get /dev/events - open it, tell
it what kind of event you want to bind to the fd, pass it to bind_event,
read a struct event_info from it when there's something to read.

Just a random thought. Ignore if not appropriate.

Mike.
-- 
Q: How many ears do Trekkies have?
A: Three; the left ear, the right ear, and the final front ear.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Maas

 Shouldn't there also be a way to add non-filedescriptor based events
 into this, such as "child exited" or "signal caught" or shm things?

Waiting on pthreads condition variables, POSIX message queues, and
semaphores (as well as fd's) at the same time would *rock*...

Unifying all these "waitable objects" would be tremendously helpful to fully
exploit the "library transparency" advantage that Linus brought up. Some
libraries might want to wait on things that are not file descriptors...

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Evan Jeffrey


 Multiple event queues are bad, because it completely breaks the notion of
 even-driven programming. How do you want to listen to them all? You can't.
 You can only listen to one event queue at a time - unless you create some

You can listen to one event queue per thread.  Maybe in the case people
always bring up (Apache handing 1 concurrent connections), threads
are indistiguishable, but for many applications different threads or thread
groups will be assigned to different tasks.   An ORB ought to be allowed
to clone off an event handler thread that doesn't have to worry about
receiving X events, which might be really slow to handle.  (Or vice versa,
handling a mouse_move event shouldn't have to wait for a CORBA upcall that
recalculates a spreasheet).

Also, what happens if someone does a clone with CLONE_FILES but no CLONE_VM?
Now the event handlers and opaque data point to things that may not even be
valid in this VM.

I see a couple of ways to do this that don't give in much from the "One
event handler list per process" idea.  Add a CLONE_EVENTS flag to clone.  This
is rather nasty, since it violates orthogonality (since a struct event
references both VM and fd objects), and makes it difficult for pthreads users
to choose whether to share event handlers.  The second is to tag events
with a threadid, and allow add_event to specify a flag saying "only deliver
to this thread" and get_events to say "wait for this threads events, or any
threads events".  The third is to just declare "event handlers are shared
by any processes sharing both a VM and FILES".  This seems rather silly
(since it would still have to be independant in the task_struct, but with
no user space control over it), and doensn't help the poor SOB who has
to handle 

 kind of "hierarchy" of event queues, where you have one event queue that
 you wait on that contains the _other_ event queues, so that you can tell
 which one has anything pending..

Agreed.  The only reason to do this is if your event queue sucks as bad
as poll() and you have to do so for effeciency reasons... (Like people do
with 10 threads each handling 1/10 of the fds).

Evan
---
Fear is the mind killer.   -- Frank Herbert, "Dune"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds



On Tue, 24 Oct 2000, Evan Jeffrey wrote:
 
  Multiple event queues are bad, because it completely breaks the notion of
  even-driven programming. How do you want to listen to them all? You can't.
  You can only listen to one event queue at a time - unless you create some
 
 You can listen to one event queue per thread.

Oh, I agree.

And I think something like CLONE_EVENTS would be fine - and decide
yourself what kind of threads you want (do you want indistinguishable
"anonymous" threads like apache, or do you want a threads that handle
separate event queues). Or you might have a mixture of the two - for web
serving the "accept()" event list might be a separate thing with a few
threads doing that, while "worker threads" handle the actual IO..

But if you want to have some event queue ID, I just wonder how you'd set
it up sanely without tons of complications..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
  But user code currently written for poll() has the luxury of dropping
  events because poll() will happily report on the current readiness of
  the socket every time.  /dev/poll is level-triggered because it's trying
  to make conversion of poll()-based code easy.  With your scheme,
  whatever user code is receiving the edges better darn well do something
  about them, because it's only going to get them once.
 
 Oh, I agree. I'm not saying that my approach magically fixes bugs in user
 space ;)

With poll(), it was *not a bug* for the user code to drop events; with
your proposed interface, it *is a bug* for the user code to drop events.
I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
incorrectly that your interface "has the same semantics as poll from
the event perspective".

[ Linus explains why the level-triggered style is difficult ]

  - the "proactively remove events when the thing that triggerred them goes
away" approach means that each anti-event (like a read that empties the
buffer) needs to undo it's events, but it also needs to be careful that
it doesn't undo combined events, and it needs to be very careful about
races (new packet coming in), so the proactive remove actually ends up
being less than trivial - and in a performance-critical section.

That's the approach I was thinking of.  It should be about as difficult
as continuously maintaining an integer for each fd which matches what poll() 
with a zero timeout on that fd would return.  (The difference is that
as the bits of the integer change, you would move the fd's record in and out of
the ready list.)  How hard is that?  

 Implement it, and see. I bet you'll find that it gets really interesting
 when you have concurrent accesses to the same file descriptor etc.

Sure, but lots of the kernel is really interesting.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread H. Peter Anvin

Followup to:  [EMAIL PROTECTED]
By author:Linus Torvalds [EMAIL PROTECTED]
In newsgroup: linux.dev.kernel
 
 Oh, I agree.
 
 And I think something like CLONE_EVENTS would be fine - and decide
 yourself what kind of threads you want (do you want indistinguishable
 "anonymous" threads like apache, or do you want a threads that handle
 separate event queues). Or you might have a mixture of the two - for web
 serving the "accept()" event list might be a separate thing with a few
 threads doing that, while "worker threads" handle the actual IO..
 
 But if you want to have some event queue ID, I just wonder how you'd set
 it up sanely without tons of complications..
 

How about handle the event queue ID as an event mask internally?

-hpa
-- 
[EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Mark Montague

Linus Torvalds [EMAIL PROTECTED] writes:

   bind_event(sock, POLLIN, NULL, accept_fn);

[...]

 (In fact, you might as well move the event array completely inside
 "get_event()", because nobody would be supposed to look at the raw array
 any more. So the "get_event()" interface would be even simpler: just a
 timeout, nothing more. Talk about simple programming.

It might be good to have bind_event return any previous accept_fn
that's there, in case a library or something wants to chain a handler
rather than clobbering what's there. Or maybe have a way to query,
although this seems fraught with danger if another thread binds
between a query and a bind.

- M

-- 
Mark "Monty" Montague | [EMAIL PROTECTED]  | I don't do Windows(tm)
I'm dubious about any company whose assets can be destroyed by rm -rf
  URL:http://www.gg.caltech.edu/~monty/monty.shtml
 X-PGP-Fingerprint: E4 EA 6D B1 82 46 DB A1  B0 FF 60 B9 F9 5D 5C F7
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Edgar Toernig

Linus Torvalds wrote:
 
 The point they disagree is when the event gets removed from the event
 queue. For edge triggered, this one is trivial: when a get_events() thing
 happens and moves it into user land. This is basically a one-liner, and it
 is local to get_events() and needs absolutely no help from anybody else.
 So obviously event removal is _very_ simple for edge-triggered events -
 the INTACK basically removes the event (and also re-arms the trigger
 logic: which is different from most interrupt controllers, so the analogy
 falls down here).

And IMHO here's a problem.  The events are no longer events.  They are
just hints saying: after the previous get_events() something has happened.
You don't know if you've already handled this event.  There's no synchron-
ization between what the app does and the triggering of 'hints'.

For example your waitpid-loop: you get the event, start the waitpid-loop.
While processing another process dies.  You handle it too (still in the
loop).  But a new 'hint' has already been registered.  So on the next
get_event you'll be notified again.  I just hope, every event-generator
has a WNOHANG flag...

It could even be possible, that you are unable to perform some actions
without triggering hints despite the fact that the conditions will
already be gone before the next get_event.  May generate lot of bogus
hints.

At least the current semantic of for example "POLL_IN on fd was signaled
so I may read without blocking" gets lost.

Maybe (don't know kernel wise) it makes sense to check in the kernel
if the events to be returned to userspace are still valid.  The user
space has to do it anyway...  But that way you get a more level-based
design ;)


Another thing: while toying with cooperative userspace multithreading
I found it much more versatile to have a req_type/req_data tuple in
the request structure (ie READ/fd, TIMEOUT/ms, WAKEUP/handle).

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Alexander Viro



On Mon, 23 Oct 2000, Jordan Mendelson wrote:

> What you describe is exactly what the /dev/poll interface patch from the
> Linux scalability project does.
> 
> It creates a special device which you can open up and write
> add/remove/modify entries you wish to be notified of using the standard
> struct pollfd. Removing entries is done by setting the events in a
> struct written to the device to POLLREMOVE.

And that's an ugly crap.

* you add struct {} into the kernel API. _Always_ a bad
idea.
* you either create yet another example of "every open() gives a
new instance" kind of device or you got to introduce a broker process.
* no easy way to check the current set.
* no fscking way to use that from scripts/etc.

> You can optionally mmap() memory which the notifications are written to.
> Two ioctl() calls are provide for the initial allocation and also to
> force it to check all items in your poll() list.

* useless use of ioctl() award

> Solaris has this same interface minus the mmap()'ed memory.

Oh, yes. Solaris. Great example of good taste...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds



On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> > What is your favourite interface then ? 
> 
> I suspect a good interface that can easily be done efficiently would
> basically be something where the user _does_ do the equivalent of a
> read-only mmap() of poll entries - and explicit and controlled
> "add_entry()" and "remove_entry()" controls, so that the kernel can
> maintain the cache without playing tricks.

Actually, forget the mmap, it's not needed.

Here's a suggested "good" interface that would certainly be easy to
implement, and very easy to use, with none of the scalability issues that
many interfaces have.

First, let's see what is so nice about "select()" and "poll()". They do
have one _huge_ advantage, which is why you want to fall back on poll()
once the RT signal interface stops working. What is that?

Basically, RT signals or any kind of event queue has a major fundamental
queuing theory problem: if you have events happening really quickly, the
events pile up, and queuing theory tells you that as you start having
queueing problems, your latency increases, which in turn tends to mean
that later events are even more likely to queue up, and you end up in a
nasty meltdown schenario where your queues get longer and longer. 

This is why RT signals suck so badly as a generic interface - clearly we
cannot keep sending RT signals forever, because we'd run out of memory
just keeping the signal queue information around.

Neither poll() nor select() have this problem: they don't get more
expensive as you have more and more events - their expense is the number
of file descriptors, not the number of events per se. In fact, both poll()
and select() tend to perform _better_ when you have pending events, as
they are both amenable to optimizations when there is no need for waiting,
and scanning the arrays can use early-out semantics.

So sticky arrays of events are good, while queues are bad. Let's take that
as one of the fundamentals.

So why do people still like RT signals? They do have one advantage, which
is that you do NOT have that silly array traversal when there is nothing
to do. Basically, the RT signals kind of approach is really good for the
cases where select() and poll() suck: no need to traverse mostly empty and
non-changing arrays all the time.

It boils down to one very simple rule: dense arrays of sticky status
information is good. So let's design a good interface for a dense array.

Basically, the perfect interface for events would be

struct event {
unsigned long id;   /* file descriptor ID the event is on */
unsigned long event;/* bitmask of active events */
};

int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

where you say "I want an array of pending events, and I have an array you
can fill with up to 'maxnr' events - and if you have no events for me,
please sleep until you get one, or until 'tmout'".

The above looks like a _really_ simple interface to me. Much simpler than
either select() or poll(). Agreed?

Now, we still need to inform the kernel of what kind of events we want, ie
the "binding" of events. The most straightforward way to do that is to
just do a simple "bind_event()" system call:

int bind_event(int fd, struct event *event);

which basically says: I'm interested in the events in "event" on the file
descriptor "fd". The way to stop being interested in events is to just set
the event bitmask to zero.

Now, the perfect interface would be the above. Nothing more. Nothing
fancy, nothing complicated. Only really simple stuff. Remember the old
rule: "keep it simple, stupid".

The really nice part of the above is that it's trivial to implement. It's
about 50 lines of code, plus some simple logic to various drivers etc to
actually inform about the events. The way to do this simply is to limit it
in very clear ways, the most notable one being simply that there is only
one event queue per process (or rather, per "struct files_struct" - so
threads would automatically share the event queue). This keeps the
implementation simple, but it's also what keeps the interfaces simple: no
queue ID's to pass around etc.

Implementation is straightforward: the event queue basically consists of

 - a queue head in "struct files_struct", initially empty.

 - doing a "bind_event()" basically adds a fasync entry to the file
   structure, but rather than cause a signal, it just looks at whether the
   fasync entry is already linked into the event queue, and if it isn't
   (and the event is one of the ones in the event bitmask), it adds itself
   to the event queue.

 - get_event() just traverses the event queue and fills in the array,
   removing them from the event queue. End of story. If the event queue is
   empty, it trivially sees that in a single line of code (+ timeout
   handling)

Advantage: everything is O(1), except for "get_event()" which is O(n)
where 'n' is the number of active 

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Jordan Mendelson

Linus Torvalds wrote:
> 
> On Tue, 24 Oct 2000, Andi Kleen wrote:
> >
> > I don't see the problem. You have the poll table allocated in the kernel,
> > the drivers directly change it and the user mmaps it (I was not proposing
> > to let poll make a kiobuf out of the passed array)

> Th eproblem with poll() as-is is that the user doesn't really tell the
> kernel explictly when it is changing the table..

What you describe is exactly what the /dev/poll interface patch from the
Linux scalability project does.

It creates a special device which you can open up and write
add/remove/modify entries you wish to be notified of using the standard
struct pollfd. Removing entries is done by setting the events in a
struct written to the device to POLLREMOVE.

You can optionally mmap() memory which the notifications are written to.
Two ioctl() calls are provide for the initial allocation and also to
force it to check all items in your poll() list.

Solaris has this same interface minus the mmap()'ed memory.


Jordan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds



On Tue, 24 Oct 2000, Andi Kleen wrote:
> 
> I don't see the problem. You have the poll table allocated in the kernel,
> the drivers directly change it and the user mmaps it (I was not proposing
> to let poll make a kiobuf out of the passed array) 

That's _not_ how poll() works at all.

We don't _have_ a poll table in the kernel, and no way to mmap it. The
poll() tables gets created dynamically based on the stuff that the user
has set up in the table. And the user can obviously change the fd's etc in
the table directly, so in order for the caching to work you need to do
various games with page table dirty or writable bits, or at least test
dynamically whether the poll table is the same as it was before.

Sure, it's doable, and apparently Solaris does something like this. But
what _is_ the overhead of the Solaris code for small number of fd's? I bet
it really is quite noticeable. I also suspect it is very optimized toward
an unchangning poll-table.

> What is your favourite interface then ? 

I suspect a good interface that can easily be done efficiently would
basically be something where the user _does_ do the equivalent of a
read-only mmap() of poll entries - and explicit and controlled
"add_entry()" and "remove_entry()" controls, so that the kernel can
maintain the cache without playing tricks.

Basically, something like a user interface to something that looks like
the linux poll_table_page structures, with the difference being that it
doesn't have to be created and torn down all the time because the user
would explicitly ask for "add this fd" and "remove this fd" from the
table.

Th eproblem with poll() as-is is that the user doesn't really tell the
kernel explictly when it is changing the table..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

Nick Piggin ([EMAIL PROTECTED]) wrote:
> > I'm trying to write a server that handles 1 clients. On 2.4.x, 
> > the RT signal queue stuff looks like the way to achieve that. 
> 
> I would suggest you try multiple polling threads. Not only will you get 
> better SMP scalability, if you have say 16 threads, each one only has to 
> handle ~ 600 fds. 

Good point.  My code is already able to use multiple network threads,
and I have done what you suggest in the past.

But I'm interested in pushing the state of the art here,
and want to see if Linux can handle it with just a single
network thread.  (My server has enough non-network threads to
keep multiple CPUs busy, don't worry :-)

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

David Schwartz wrote:
> > I'm trying to write a server that handles 1 clients.  On 2.4.x,
> > the RT signal queue stuff looks like the way to achieve that.
> > Unfortunately, when the RT signal queue overflows, the consensus seems
> > to be that you fall back to a big poll().   And even though the RT signal
> > queue [almost] never overflows, it certainly can, and servers have to be
> > able to handle it.
> 
> Don't let that bother you. In the case where you get a hit a significant
> fraction of the descriptors you are polling on, poll is very efficient. The
> inefficiency comes when you have to wade through 10,000 uninteresting file
> descriptors to find the one interesting one. If the poll set is rich in
> ready descriptors, there is little advantage to signal queues over poll
> itself.
> 
> In fact, if you assume the percentage of ready file descriptors (as opposed
> to the number of file descriptors) is constant, then poll is just as
> scalable (theoretically) as any other method. Under both schemes, with twice
> as many file descriptors you have to do twice as much work.

Yep, I've made similar arguments myself.  It's just that seeing
poll() take 14 milliseconds to return on a 650 MHz system is a little daunting.
I'll report again when I have results for RT signal stuff
and different percentages of idle sockets (probably 0, 1, and 10).

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Jordan Mendelson

Dan Kegel wrote:
> 
> Jordan Mendelson ([EMAIL PROTECTED]) wrote:
> > An implementation of /dev/poll for Linux already exists and has shown to
> > be more scalable than using RT signals under my tests. A patch for 2.2.x
> > and 2.4.x should be available at the Linux Scalability Project @
> > http://www.citi.umich.edu/projects/linux-scalability/ in the patches
> > section.
> 
> If you'll look at the page I linked to in my original post,
>   http://www.kegel.com/dkftpbench/Poller_bench.html
> you'll see that I also benchmarked /dev/poll.

The Linux /dev/poll implementation has a few "non-standard" features
such as the ability to mmap() the poll structure memory to eliminate a
memory copy.

int dpoll_fd;
unsigned char *dpoll;
struct pollfd *mmap_dpoll;

dpoll_fd = open("/dev/poll", O_RDWR, 0);
ioctl(dpoll_fd, DP_ALLOC, 1);
dpoll = mmap(0, DP_MMAP_SIZE(1), PROT_WRITE|PROT_READ,
MAP_SHARED,   dpoll_fd, 0);

dpoll = (struct pollfd *)mmap_dpoll;

Use this memory when reading and write() to add/remove and see if you
get any boost in performance. Also, I believe there is a hash table
associated with /dev/poll in the kernel patch which might slow down your
performance tests when it's first growing to resize itself.


Jordan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



  1   2   >