I think we have two pretty deep problems in the way microcode primitives work. Both fortunately and unfortunately, they manifest very seldom in practice: fortunately because, well, most of the time everything's OK; unfortunately because it makes them all the more mysterious to diagnose and difficult to debug.
Problem 1 is that if a primitive is interrupted in a system call, it may either wedge Scheme or quietly have bad effects such as leaking memory, repeating side effects, &c. Problem 2 is that primitives that allocate on the heap may send Scheme into an infinite loop because the bookkeeping of how much space is needed is wrong. (And they might also leak memory, repeat side effects, &c.) I would like to fix these problems, but it's not straightforward. Every fix I have come up with entails non-trivial changes throughout the microcode. Please let me know if you have any thoughts on the subject. Apologies for the long message. * Problem 1: Interrupted primitives In order to be responsive to ^G and to let threads make progress, the system relies on handling interrupts in Scheme soon after the operating system delivers them -- e.g., when you type ^G at the tty to deliver SIGINT, or when the thread timer goes off and the OS delivers SIGALRM. If a signal arrives during a system call, the system call will return EINTR. Until 2009, most primitives making system calls would simply restart the system call on EINTR. This is wrong because it makes Scheme unresponsive and doesn't let other threads run. For example, if you tried to open a fifo, or to work with a remote file system when the network was down, your whole Scheme would freeze until someone opened the other end of the fifo, or until the network stack timed out. In 2009 (commit 6cf614a), I changed most primitives so that if a system call fails with EINTR, and there is a pending interrupt to handle in Scheme, Scheme will back out of the primitive, handle the interrupt, and then restart the primitive. (If there is no pending interrupt, the primitive will just restart the system call as before.) But this is wrong too because it is not safe to back out and restart many primitives at this point -- they have already had permanent side effects that cannot be rolled back. I have two ideas for policies to solve this problem. (a) If a primitive performs a permanent side effect, it may not use any system call afterward that could ever conceivably block. This means eliminating (or at least deprecating) complex primitives such as FILE-TOUCH and replacing them by Scheme code. (b) PCLSR: After each permanent side effect, a primitive must mark its activation record so that if restarted after backing out, it will pick up where it left off. This might entail using bind_interpreter_state after each effect and unbind_interpreter_state before the primitive returns, or it might entail replacing the primitive apply frame on the stack. This may require essentially rewriting a number of primitives in continuation-passing style. * Problem 2: Heap allocation bookkeeping in primitives Whenever a primitive wants to allocate n words on the heap, it checks whether n words are available. If not, it requests n words from the garbage collector and backs out to let the garbage collector run. If the garbage collector frees n words, Scheme restarts the primitive; otherwise, Scheme aborts to the nearest REPL and tells you that it's out of memory. I believe most primitives in the system back out for GC only when it is actually safe to do so -- the only exceptions I know of arise from the use of primitives as an FFI, such as prpgsql.c, which should be rewritten anyway. However, if a primitive allocates n words, and then m words, in separate steps, the following may happen: 1. Suppose n + (m - k) words are free, where 0 < k <= min(n, m). 2. The primitive allocates n words, so now m - k are free. 3. The primitive tries to allocate m words, but only m - k are free. 4. The primitive asks the garbage collector to make m words free. 5. The garbage collector frees the n words the primitive allocated. 6. Since n + (m - k) >= m words are now free, the primitive restarts. 6. Return to step 1: Scheme is now wedged. I have three ideas for policies to solve this problem. (a) Every primitive must count up front how many words it plans to allocate. Scheme48 does this, and has a little abstraction for the accounting (ENSURE-SPACE and allocation keys) that can either verify the accounting at run-time or be compiled away to zero overhead. This requires going through every primitive that allocates, and calculating an upper bound on how much it may allocate. (b) For every primitive invocation, the microcode must keep a record of the number of words the primitive has allocated. If there is too little space free for some allocation, then the garbage collector must make free the total number of words the requested so far by the primitive, not just the number requested in that allocation, before restarting the primitive. This means a possibly significant run-time overhead for the bookkeeping, but may be doable without changing many primitives, by remembering Free on entry to each primitive in order to find how much the primitive has advanced Free so far. (c) PCLSR: After each allocation, a primitive must mark its activation record so that the garbage collector won't free what it has just allocated, and if restarted after backing out, it will pick up where it left off. The linker in cmpint.c does this, and it's very hairy. P.S. Compiled code also has a problem with allocation... Every entry to a procedure checks whether Free < heap_alloc_limit, and branches to an interrupt if not. What it needs to check is whether Free + n < heap_alloc_limit, where n is the number of words it allocates in-line. The stack checks have the same problem. Usually this is all OK because most procedures won't allocate more space than there is in the buffer between heap_alloc_limit and heap_end, but `usually' is no substitute for a guarantee. _______________________________________________ MIT-Scheme-devel mailing list MIT-Scheme-devel@gnu.org https://lists.gnu.org/mailman/listinfo/mit-scheme-devel