As I'm working on a new project written in Lisp for ECL, I have encountered some stability issues, but it seems that it's now getting much more stable after having investigated the issues and consequently doing some tests. Ideally this should probably eventually be documented better and officially, but I thought I'd share these for now.
If someone can adapt some of this to actual documentation, they're welcome. Other than the wiki, the current documentation seems to be docbook which I personally have no experience with yet. - One issue involved unresponsive busy loops with ENOMEM errors encountered by ECL and libgc. NetBSD (among other multiuser systems) implements soft/current and hard/maximum limits on various resources and these are configurable via login classes (login.conf) and sysctl (or setrlimit(2)) per process. Soft limits on datasize will issue ENOMEM errors, and as an option the process might raise the limit until the maximum limit is reached (it seems that ECL or libgc don't do it automatically, though, and won't adapt their own limits to the current rlimits). The default ECL memory heap size used with Boehm-GC seems to be 1GB (and is configurable using the --heap-size option, or with the EXT:SET-LIMIT function. If that size is reached, ECL signals a condition with the option to grow the heap. If the OS-specific limit is reached before the ECL configured limit is reached, desastrous consequences can arise, where ECL endlessly loops attempting to signal a condition but getting ENOMEM errors doing so. This could possibly be worked around using some preallocated memory for that very situation. However, ensuring that the ECL heap limit is always smaller than the OS-set limit prevents this situation. Similar problems might occur with other limits such as the stack or file descriptors. I suspect that reaching the fd limit is less critical than the stack or heap limit. ECL's stack limit should also be smaller than the OS's stacksize soft limit. These default limits are set in src/c/main.d. - Another problem I recently faced had to do with Boehm-GC (libgc) locking in a deadlock when it needs to collect or grow (it then goes through the routine GC_collect_or_expand() -> GC_try_to_collect_inner() -> GC_stopped_mark() -> GC_stop_world()). To stop the world and perform a garbage collection run, it uses OS-specific code, on several POSIX systems it uses pthread_kill() to interrupt every thread with a signal and cause them to invoke the GC_suspend_handler() signal handler. Those threads then wait for another signal or on a lock/condvar to resume after collection. Unfortunately that part is complex and error-prone, and considering all the OS-specifics libgc may be more stable and better tested on some systems than others. The ECL-supplied gc (7.1.9 if I remember) is slower on NetBSD than the one I had compiled from pkgsrc (7.2), but it took me longer to reproduce the issue with it. Stdio file descriptors have an internal buffer state, which internally use mutexes on NetBSD when a process is linked with libpthread. It appears that libgc has concurrency issues when stdio is heavily used when it attempts to collect. I could not reproduce that issue yet on Linux+glibc, but I assume libgc is also more tested on it. I wrote some minimal test to reproduce the issue and it indeed had to do with threaded libgc+stdio. I then modified my application to use :CSTREAM NIL when opening its output FIFO file, and the application uptime was noticeably better. There however were two remaining issues: only one character at a time was written, even using WRITE-SEQUENCE on a LATIN-1 or PASSTHROUGH external format (meaning in this case several thousand write(2) syscalls per second, surrounded by other syscalls related to interrupt control). Despite this I decided to initially stress test the application, and it had a decent uptime, until the same issue happened again. I then noticed a spurious fflush(3) call, which might be that of eformat_write*(). To solve both issues and be able to move forward with testing, I wrote a small WRITE-SEQUENCE replacement using C-INLINE, as I had done for Crow-HTTPd. Performance dramatically improved (I use custom BASE-CHAR vectors as buffers and large direct write(2) to the descriptor), and so far it has been stable (although it's still being stressed tested). To mitigate this, ECL could be made to not use stdio and use its own buffering streams on top of file descriptors, however this would not solve the issue where libraries used with FFI need to be supplied an stdio FILE handle. It currently uses unbuffered file descriptors when compiled with threads for stdin/stdout/stderr, possibly partly because of that reason, but the comment also mentions blocking. With some care not to use stdio in the application itself as well, stability seems dramatically increased. This has not been verified yet but I suspect that the occasional locking issues I observe with C-c C-c during live interactive development might also be related to stdio usage. - A previously discovered issue when writing Crow-HTTPd had also been related to libgc+threads race conditions, but at thread termination. It seems that the Mono runtime also was affected by that on Solaris. For simplicity I had setup Crow to avoid shrinking the threads pool, but there might have been other solutions. However, Crow then became very stable. It still runs my site and has uptimes as long as the server (occasionally interrupted for security software updates). - Probably also worth mentioning is that ECL itself avoids synchronizing every access to potentially shared user objects, other than where necessary like for packages. This means that obviously, the user is responsible for providing explicit synchronization to concurrently accessed objects, including hash tables and instance objects, using MP primitives. This also has to be considered for interactive development where the REPL might be used to alter the state of live objects. Ideally a single access library should be written which provides the synchronization, such that both the software and REPL user use them. - It is very important to take heed when ECL issues warnings about an object being of type NIL. This occurs when using optimizations and conflicting annotations exist for a variable. In case where ECL issues this warning on a vector and the user lowers SAFETY to 1 or below and raises SPEED, it might optimize access to inline C using the largest native machine word (64-bit on amd64), rather than the expected word size. On the other hand, if no large scope DECLAIM TYPE annotation exists, every function may issue a conflicting local scope DEFINE TYPE annotation, and ECL can allow to silently shoot everyone in the foot at your request (even if those functions are inlined). This can be an advantage, but it's low level enough to be dangerous. It's possible for instance to access a byte vector using byte-32 or byte-64 access using SAFETY 0, but it becomes your responsibility to ensure alignment and avoid potential conflicts in relation to the fill-pointer and dimension. Doing this is also obviously very implementation-dependent (I tested the following which works on ECL but fails with SBCL (obviously other than the inline C): ;;; LDB didn't optimize well here, and the chain of THE FIXNUM and ;;; LOGAND/ASH calls tedious (defun byteorder-bswap16 (word) (declare (optimize (speed 3) (safety 0) (debug 0)) (type (unsigned-byte 16) word)) (the (unsigned-byte 16) #+:little-endian (ffi:c-inline (word) (:uint16-t) :uint16-t " uint16_t w = #0; @(return) = ((w & 0xff00) >> 8 | (w & 0x00ff) << 8); " :one-liner nil :side-effects nil) #-:little-endian word)) (declaim (inline get-byte8)) (defun get-byte8 (vector offset) (declare (optimize (speed 3) (safety 0) (debug 0)) (type (vector (unsigned-byte 8) *) vector) (type fixnum offset)) (the (unsigned-byte 8) (aref vector offset))) (declaim (inline get-byte16)) (defun get-byte16 (vector offset) (declare (optimize (speed 3) (safety 0) (debug 0)) (type (vector (unsigned-byte 16) *) vector) (type fixnum offset)) (the (unsigned-byte 16) (byteorder-bswap16 (aref vector offset)))) And to supply the same byte vector to both functions. If a DECLAIM existed for that common vector, a warning about NIL type would be issued, and inline 64-bit access to these vectors would be generated. If the vector isn't 16-bit aligned (or 64-bit aligned in the case of the warning), it might cause a SIGBUS on some architectures. Thus the reminder: if you need this kind of low level optimization, it's best to also inspect the resulting C, and to only do it where necessary... - For several reasons, when using the C compiler, it's useful to use FUNCALL/APPLY with function symbols instead of direct function calls to certain functions or direct function references, when those functions are likely to change a lot and be recompiled (or re-evaluated and interpreted). I.e. (FUNCALL 'FOO) vs (FUNCALL #'FOO) or (FOO). When debugging code or testing new freshly written modifications, newly introduced bugs, or fixes, might not become immediately visible otherwise without recompiling the dependent code (just as is similar when redifining a structure versus a CLOS class). There is the case of inline functions, but also of large code blocks or whole-file compilation which might compile to direct function calls. Another possibility is to use the interpreter at that stage. For the same reasons, using classes for frequently changed structures is slower at runtime, but better for interactive development and consistent results, than using structures. - Occasionally SLIME might crash as if the Lisp image itself had crashed, but once restarted with the SLIME command, detects a running ECL and asks if we want a new image. Answering no often resumes properly and it does not mean that the image is corrupted, but that SWANK or SLIME bugs still exist, or that the produced output of a REPL-entered form could not be handled. I noticed that improperly defining PRINT-OBJECT methods can also be dangerous for stability, and that it's often simpler to use a custom method or function instead, at least at initial development stages. There are many requirements for proper integrated printing, and afterall that's why PRINT-UNREADABLE-OBJECT is quite helpful... sometimes when SLIME can't handle a situation using the REPL directly with the embedded debugger is still useful. That's it for now :) -- Matt ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list