Re: [Pharo-project] Fwd: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!

Adrian Lienhard Tue, 07 Dec 2010 03:39:10 -0800

The changes between 1.1 and 1.1.1 are the issues in [1]. None seems related... 
did I miss something?


One change that I don't understand, although it probably is unrelated, is in 
[2]: 

LargePositiveInteger removeSelector: #=!
LargePositiveInteger removeSelector: #bitAnd:!
LargePositiveInteger removeSelector: #bitOr:!
LargePositiveInteger removeSelector: #bitShift:!
LargePositiveInteger removeSelector: #bitXor:!
LargePositiveInteger removeSelector: #'~='!

Why would one want to remove these primitive calls from large integers?

Cheers,
Adrian

[1] 
http://code.google.com/p/pharo/issues/list?can=1&q=Milestone%3D1.1.1&colspec=ID+Type+Status+Summary+Milestone+Difficulty&cells=tiles
[2] 
http://code.google.com/p/pharo/issues/attachmentText?id=2912&aid=-2442931684430823333&name=NecessaryImageChangesForCogToWork.Pharo1.1.cs&token=4a16b7709abc303c3826e5be2743eeb7


On Dec 7, 2010, at 09:52 , Mariano Martinez Peck wrote:

> ---------- Forwarded message ----------
> From: David T. Lewis <[email protected]>
> Date: Tue, Dec 7, 2010 at 2:06 AM
> Subject: Re: [Vm-dev] Image freeze because handleTimerEvent and Seaside
> process gone?!
> To: Squeak Virtual Machine Development Discussion <
> [email protected]>, [email protected]
> 
> 
> 
> On Mon, Dec 06, 2010 at 12:33:59PM -0800, Andreas Raab wrote:
>> 
>> At a guess, I'd say it's either one of two issues:
>> 
>> 1) Your STOP/CONT handling. This sounds suspicious and it could affect
>> the timer handling. I'm assuming that the issue happens after receiving
>> the CONT signal, no? If you can, you might want to a) make sure that you
>> only get the STOP signal when the VM is in ioRelinquish() and not (for
>> example) currently executing the delay process and b) consider to dump
>> the call stacks whenever the VM gets the CONT signal to see what the
>> status is.
>> 
>> 2) Some set of incomplete process/delay/semaphore changes in Pharo. One
>> of the problems with processes and delays is that this part of the
>> system reacts very badly to random "cleaning". I.e., changing "foo ==
>> nil" to "foo isNil" can have dramatic effects (since it introduces a
>> suspension point) with just the kind of weird issue you're seeing.
> 
> Actually #2 does seem like a likely culprit. I found a Pharo 1.1 image
> and loaded the CommandShell and OSProcess test suites. The CommandShell
> tests put a heavy load on process switching, and are rather timing
> dependent. On Pharo 1.1 I get intermittent and non-reproducible errors
> and test failures, and I can't get a clean run of the test suite. The
> errors seem to be different each time.
> 
> On Pharo 1.1.1 and 1.2 I can get clean runs of the CommandShell/OSProcess
> tests, so I think there must be some issues in Pharo 1.1. If you are
> using PharoCore 1.1 now and have the option of moving to Pharo 1.1.1
> or 1.2, I suspect you may see the problems go away.
> 
> Dave
> 
> 
>> 
>> With regards to these processes not being printed, that's a side effect
>> of how printAllStacks gathers the processes - it will not print
>> suspended processes which explains why the UI process doesn't print and
>> most likely handleTimerEvent is suspended in a debugger.
>> 
>> Depending on how important this issue is you can also try to dissect the
>> object memory itself. If you call writeImageFile (or is it
>> writeImageFileIO?) from gdb it will dump the .image file and you can use
>> the simulator to look at it more closely. Most likely you'll be able to
>> find the processes and look at their stacks.
>> 
>> Cheers,
>>  - Andreas
>> 
>> On 12/6/2010 2:55 AM, Adrian Lienhard wrote:
>>> 
>>> Hi all,
>>> 
>>> We've been experiencing an "interesting" problem: the image freezes and
>>> does not response to HTTP requests anymore after it has been running for
>>> days.
>>> 
>>> Here some basic information about our setup:
>>> 
>>> Squeak VM 4.0.3-2202 compiled with gcc 4.3.2
>>> PharoCore 1.1
>>> OS Debian Lenny amd64 (CPUs are 4 Intel Xeon E5530 2.40GHz)
>>> 
>>> - We have never seen the problem with the Squeak VM 3.9-9 and Squeak 3.9
>>> on the identical machine and with the same application source (modulo
> some
>>> adaptations to make it run on Pharo).
>>> - We run the VM with -mmap 512m -vm-sound-null -vm-display-null, and the
>>> UI process is suspended (Project uiProcess suspend)
>>> - VM does not hog the CPU and memory usage is normal
>>> - The meantime between failure is several weeks and we haven't managed to
>>> reproduce the problem
>>> - The application mainly serves HTTP requests. When the image does not
>>> receive requests for some time we send it a STOP signal, when a request
>>> comes in it is sent a CONT signal.
>>> - lsof shows
>>>    TCP *:9093 (LISTEN)
>>>    TCP server:9093->server:46930 (CLOSE_WAIT)
>>> 
>>> Below is a GDB backtrace and the Smalltalk stacks from an image that was
>>> frozen (the VM had been running for almost 100 hours):
>>> 
>>> =============================================================
>>> (gdb) bt
>>> #0  0x08072020 in ?? ()
>>> #1<signal handler called>
>>> #2  0xb766f5e0 in malloc () from /lib/libc.so.6
>>> #3<function called from gdb>
>>> #4  0xb76c50c8 in select () from /lib/libc.so.6
>>> #5  0x08071063 in aioPoll ()
>>> #6  0xb778bb8d in ?? () from
> /usr/lib/squeak/4.0.3-2202//so.vm-display-null
>>> #7  0x000003e8 in ?? ()
>>> #8  0x997b5a34 in ?? ()
>>> #9  0xbfe7cb28 in ?? ()
>>> #10 0x08074575 in ioRelinquishProcessorForMicroseconds ()
>>> Backtrace stopped: frame did not save the PC
>>> 
>>> (gdb) call printCallStack()
>>> -1719969228>idleProcess
>>> -1719969320>startUp
>>> -1740134028 BlockClosure>newProcess
>>> $3 = -1755344892
>>> 
>>> (gdb) call (int) printAllStacks()
>>> Process
>>> -1719969228>idleProcess
>>> -1719969320>startUp
>>> -1740134028 BlockClosure>newProcess
>>> 
>>> Process
>>> -1740113860>finalizationProcess
>>> -1740113952>restartFinalizationProcess
>>> -1740113532 BlockClosure>newProcess
>>> 
>>> Process
>>> -1740134424 SmalltalkImage>lowSpaceWatcher
>>> -1740134516 SmalltalkImage>installLowSpaceWatcher
>>> -1740134300 BlockClosure>newProcess
>>> 
>>> Process
>>> -1719451488 Delay>wait
>>> -1719451580 BlockClosure>ifCurtailed:
>>> -1719451704 Delay>wait
>>> -1719451796 InputEventPollingFetcher>waitForInput
>>> -1740126940 InputEventFetcher>eventLoop
>>> -1740127032 InputEventFetcher>installEventLoop
>>> -1740126816 BlockClosure>newProcess
>>> 
>>> Process
>>> -1719557780 UnixOSProcessAccessor>grimReaperProcess
>>> -1740113624 BlockClosure>repeat
>>> -1740113716 UnixOSProcessAccessor>grimReaperProcess
>>> -1740117340 BlockClosure>newProcess
>>> 
>>> [omitted many newlines between output above]
>>> =============================================================
>>> 
>>> What is striking from the above process listing is that two processes are
>>> missing: the handleTimerEvent process and the Seaside process (that is,
>>> the TCP listener loop). How comes these processes vanished?
>>> 
>>> This may be related to Pharo or to the Squeak VM.
>>> 
>>> Has anybody else seen this problem? Any idea how to debug/fix this issue
>>> is very much appreciated!
>>> 
>>> Cheers,
>>> Adrian
>>> 
>>> 
>>> CCed to pharo-dev since this may be related to Pharo; please respond on
>>> the squeak-vm list
>>> 
>>> 
>>>

Re: [Pharo-project] Fwd: [Vm-dev] Image freeze because handleTimerEvent and Seaside process gone?!

Reply via email to