---------- Forwarded message ---------- From: David T. Lewis <[email protected]> Date: Tue, Dec 7, 2010 at 2:06 AM Subject: Re: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?! To: Squeak Virtual Machine Development Discussion < [email protected]>, [email protected]
On Mon, Dec 06, 2010 at 12:33:59PM -0800, Andreas Raab wrote: > > At a guess, I'd say it's either one of two issues: > > 1) Your STOP/CONT handling. This sounds suspicious and it could affect > the timer handling. I'm assuming that the issue happens after receiving > the CONT signal, no? If you can, you might want to a) make sure that you > only get the STOP signal when the VM is in ioRelinquish() and not (for > example) currently executing the delay process and b) consider to dump > the call stacks whenever the VM gets the CONT signal to see what the > status is. > > 2) Some set of incomplete process/delay/semaphore changes in Pharo. One > of the problems with processes and delays is that this part of the > system reacts very badly to random "cleaning". I.e., changing "foo == > nil" to "foo isNil" can have dramatic effects (since it introduces a > suspension point) with just the kind of weird issue you're seeing. Actually #2 does seem like a likely culprit. I found a Pharo 1.1 image and loaded the CommandShell and OSProcess test suites. The CommandShell tests put a heavy load on process switching, and are rather timing dependent. On Pharo 1.1 I get intermittent and non-reproducible errors and test failures, and I can't get a clean run of the test suite. The errors seem to be different each time. On Pharo 1.1.1 and 1.2 I can get clean runs of the CommandShell/OSProcess tests, so I think there must be some issues in Pharo 1.1. If you are using PharoCore 1.1 now and have the option of moving to Pharo 1.1.1 or 1.2, I suspect you may see the problems go away. Dave > > With regards to these processes not being printed, that's a side effect > of how printAllStacks gathers the processes - it will not print > suspended processes which explains why the UI process doesn't print and > most likely handleTimerEvent is suspended in a debugger. > > Depending on how important this issue is you can also try to dissect the > object memory itself. If you call writeImageFile (or is it > writeImageFileIO?) from gdb it will dump the .image file and you can use > the simulator to look at it more closely. Most likely you'll be able to > find the processes and look at their stacks. > > Cheers, > - Andreas > > On 12/6/2010 2:55 AM, Adrian Lienhard wrote: > > > >Hi all, > > > >We've been experiencing an "interesting" problem: the image freezes and > >does not response to HTTP requests anymore after it has been running for > >days. > > > >Here some basic information about our setup: > > > >Squeak VM 4.0.3-2202 compiled with gcc 4.3.2 > >PharoCore 1.1 > >OS Debian Lenny amd64 (CPUs are 4 Intel Xeon E5530 2.40GHz) > > > >- We have never seen the problem with the Squeak VM 3.9-9 and Squeak 3.9 > >on the identical machine and with the same application source (modulo some > >adaptations to make it run on Pharo). > >- We run the VM with -mmap 512m -vm-sound-null -vm-display-null, and the > >UI process is suspended (Project uiProcess suspend) > >- VM does not hog the CPU and memory usage is normal > >- The meantime between failure is several weeks and we haven't managed to > >reproduce the problem > >- The application mainly serves HTTP requests. When the image does not > >receive requests for some time we send it a STOP signal, when a request > >comes in it is sent a CONT signal. > >- lsof shows > > TCP *:9093 (LISTEN) > > TCP server:9093->server:46930 (CLOSE_WAIT) > > > >Below is a GDB backtrace and the Smalltalk stacks from an image that was > >frozen (the VM had been running for almost 100 hours): > > > >============================================================= > >(gdb) bt > >#0 0x08072020 in ?? () > >#1<signal handler called> > >#2 0xb766f5e0 in malloc () from /lib/libc.so.6 > >#3<function called from gdb> > >#4 0xb76c50c8 in select () from /lib/libc.so.6 > >#5 0x08071063 in aioPoll () > >#6 0xb778bb8d in ?? () from /usr/lib/squeak/4.0.3-2202//so.vm-display-null > >#7 0x000003e8 in ?? () > >#8 0x997b5a34 in ?? () > >#9 0xbfe7cb28 in ?? () > >#10 0x08074575 in ioRelinquishProcessorForMicroseconds () > >Backtrace stopped: frame did not save the PC > > > >(gdb) call printCallStack() > >-1719969228>idleProcess > >-1719969320>startUp > >-1740134028 BlockClosure>newProcess > >$3 = -1755344892 > > > >(gdb) call (int) printAllStacks() > >Process > >-1719969228>idleProcess > >-1719969320>startUp > >-1740134028 BlockClosure>newProcess > > > >Process > >-1740113860>finalizationProcess > >-1740113952>restartFinalizationProcess > >-1740113532 BlockClosure>newProcess > > > >Process > >-1740134424 SmalltalkImage>lowSpaceWatcher > >-1740134516 SmalltalkImage>installLowSpaceWatcher > >-1740134300 BlockClosure>newProcess > > > >Process > >-1719451488 Delay>wait > >-1719451580 BlockClosure>ifCurtailed: > >-1719451704 Delay>wait > >-1719451796 InputEventPollingFetcher>waitForInput > >-1740126940 InputEventFetcher>eventLoop > >-1740127032 InputEventFetcher>installEventLoop > >-1740126816 BlockClosure>newProcess > > > >Process > >-1719557780 UnixOSProcessAccessor>grimReaperProcess > >-1740113624 BlockClosure>repeat > >-1740113716 UnixOSProcessAccessor>grimReaperProcess > >-1740117340 BlockClosure>newProcess > > > >[omitted many newlines between output above] > >============================================================= > > > >What is striking from the above process listing is that two processes are > >missing: the handleTimerEvent process and the Seaside process (that is, > >the TCP listener loop). How comes these processes vanished? > > > >This may be related to Pharo or to the Squeak VM. > > > >Has anybody else seen this problem? Any idea how to debug/fix this issue > >is very much appreciated! > > > >Cheers, > >Adrian > > > > > >CCed to pharo-dev since this may be related to Pharo; please respond on > >the squeak-vm list > > > > > >
