From: Simon Marlow <[EMAIL PROTECTED]>
> Yes, I know: heap overflow is somewhat trickier than stack overflow.  If
the
> stack overflows, at least you know that by raising an exception you're
going
> to reduce the size of the stack, the same isn't true of heap overflow, so
> you need some kind of "soft limit" after which the exception is thrown.

I don't think the problem can be dealt with by exceptions alone. Unless
every thread receives the exception (which would complicate programs),
the threads that do receive it need a little help from the scheduler so that
they can try to clean up without interference from the remaining threads.

Resource exhaustion is critical information. When the limit is reached,
every
thread allocating from the same "resource pool" (i.e., not on another
machine)
has to react or be acted on. Any thread that does not receive the exception
has to be terminated (suspension might be enough, though) if it attempts to
claim further resources before the crisis has been resolved.

For illustration, consider several limits:

soft 2: dear software <allocating from pool P>, we are going to run out of
            <resource R> soon. Could you please look into this?

soft 1: to all software <allocating from pool P>: <resource R> is low.
            Do something, or I'll do something.

hard 0: <resource R> unavailable, all further requests for it fail.

hard -1: no further progress possible. clean up system for restart.

For soft 2, it might suffice to throw an exception only to a manager thread
(main by default), but when soft 1 is reached, no thread that is unaware of
the problem should be allowed to contribute to it. It's a soft version of
hard 0, so I would try to suspend any thread (in the current resource pool)
that tries to allocate (the resource in question) without being aware that
the pool is in soft 1 for that resource.

> And which thread should get the exception?  All of them?  Just the one
that
> was allocating at the time?  The main thread?  (this reminds me of one of
> the great open questions in OS design: if you run out of memory, which
> process do you kill?).  Perhaps a thread should have to register in order
to
> receive the heapOverflow exception, or maybe you should install a "heap
> overflow handler", like a signal handler.  Comments?

With the explanation given above, I would say: the whole system of threads
should get the exception, but that doesn't necessarily mean that each and
every thread gets it (and with soft limits, you don't need to kill any
process,
just change the environment so that problematic processes kill themselves) .
Rather, when resource R in pool P is low, the threads can be grouped into

- manager threads:
    receive exception, try to deal with it
- affected non-manager threads:
    don't get exception, try to allocate R in P
- neutral threads:
    don't get exception, don't use R in P

Neutral threads don't interfere and can be kept active, probably even
running (it is tempting to allocate all other resources - such as processor
time - to problem resolution instead of sharing them with neutral threads,
but I don't know whether that would be wise: if my browser runs out of
memory, I don't want the rest of my system to stall until the browser has
solved its problem).

As soon as a neutral thread becomes affected, it should be suspended.

Manager threads continue to run up to hard 0, and if they can make
further progress without allocating R in P, they may run up to hard -1.
If they manage to resolve the crisis, suspended affected threads will
wake up without specific actions, and if all managers fail before
resolving the crisis, the runtime system has to clean up.

How can a manager thread resolve the crisis?
Mainly by terminating threads (it can start with those that can be
restarted without starting the whole computation from scratch),
perhaps restarting them later, in a less resource-hungry mode.
Note that suspended threads in CH are interruptible by default, so
that even threads blocking on resource R in P can be terminated by
thowing them an appropriate exception. Another option is to shift
resources at CH level (such as emptying input MVars or filling
output MVars) in order to suspend or restrain resource hungry
threads until later.

When is later?
Manager threads need a way to find out when their job is done!
This could be a `resource available'-exception, but an explicit
status enquiry is probably easier to handle.

The implementation might be simpler that it looks at first sight:
the RTS knows which resources are low, but it also needs to know
which threads are aware of this (keeping a list of threads that have
received the corresponding exception). Then, during a crisis,
it can grant allocation requests from crisis-aware threads and
suspend all others (giving priority to managers without even
knowing about the concept). When everything is blocked, the
RTS has to act anyway, and resource pools are given implicitly
by "running in the same RTS instance". Status enquiries for
resources seem to be new, but unproblematic, and would also
allow for preventive actions by all threads to avoid fire fighting.

One nice thing about this scheme is that many threads can be kept
free of resource exception handling (simplifying their code) while
also being  kept from adding to the crisis (by suspend/wake up)
[this is especially important as CH doesn't support resume after
exceptions (*)]. The managers don't have to worry about or deal
with non-manager threads (avoiding race conditions), they only
have to coordinate among themselves, futher reducing the code
complexity of exception handling.

Another nice feature: it doesn't really matter (to the runtime system)
which threads receive the exception (those that don't will be
suspended before they can cause additional harm). So it is solely
in the interest of the thread system to nominate some managers
(by opening them for the exception in question, or registering,
as you suggested) and no threads need to receive the exception
by default.

This seems to cover the issues raised by Marcin (race conditions,
simplyfied handling when that is good enough, but able to handle
both local and distributed problem settings, not interrupting threads
when other means are available, reserving sparse resources for
handlers, terminating threads only when absolutely necessary).

Does this make sense? What have I missed?

Claus

PS One potential problem: resource R2 runs low while a
        shortage of R1 is being dealt with. If there are separate
        managers for both resources, the scheme sketched
        above should work, but what if only one thread is
        nominated to deal with all exceptions?

(*) Perhaps you do already have the mechanisms in place
    that would allow threads to resume after exception
    handling? Here is a naive idea:

    Currently, an asynchronous exception inserts a throw at
    the front of the receiving thread. The throw consumes
    thread actions up to the next matching exception handler,
    making resume impossible.

    Alternative: an asynchronous exception copies the
    receiving thread (this copying could be done lazily and
    might, or might not, be expensive, depending on your
    implementation). The copy is suspended indefinitely,
    the original is dealt with almost as before, but the thread
    Id of the copy is added to the exception. Now, the
    exception handler in the original has the choice to
    terminate the copy and continue, or to wake up the
    copy and terminate itself.



Reply via email to