On Wed, Jun 03, 2015 at 07:05:15AM +0200, Thierry Goubier wrote:
> Hi Dave,
>
> Le 03/06/2015 03:15, David T. Lewis a ?crit :
> >Hi Thierry and Jose,
> >
> >I am reading this thread with interest and will help if I can.
> >
> >I do have one idea that we have not tried before. I have a theory that
> >this may
> >be an intermittent problem caused by SIGCHLD signals (from the external OS
> >process
> >when it exits) being missed by the UnixOSProcessAccessor>>grimReaperProcess
> >that handles them.
> >
> >If this is happening, then I may be able to change grimReaperProcess to
> >work around the problem.
> >
> >When you see the OS deadlock condition, are you able tell if your Pharo VM
> >process has subprocesses in the zombie state (indicating that
> >grimReaperProcess
> >did not clean them up)? The unix command "ps -axf | less" will let you look
> >at the process tree and that may give us a clue if this is happening.
>
> I found it very easy to reproduce and I do have a zombie children
> process to the pharo process.
Jose confirms this also (thanks).
Can you try filing in the attached UnixOSProcessAccessor>>grimReaperProcess
and see if it helps? I do not know if it will make a difference, but the
idea is to put a timeout on the semaphore that is waiting for signals from
SIGCHLD. I am hoping that if these signals are sometimes being missed, then
the timeout will allow the process to recover from the problem.
>
> Interesting enough, the lock-up happens in a very specific place, a call
> to git branch, which is a very short command returning just a few
> characters (where all other commands have longuer outputs). Reducing the
> frequency of the calls to git branch by a bit of caching reduces the
> chances of a lock-up.
>
This is a good clue, and it may indicate a different kind of problem (so
maybe I am looking in the wrong place). Ben's suggestion of adding a delay
to the external process sounds like a good idea to help troubleshoot it.
Dave
'From Squeak4.5 of 30 May 2015 [latest update: #15039] on 2 June 2015 at
9:35:21 pm'!
!UnixOSProcessAccessor methodsFor: 'initialize - release' stamp: 'dtl 6/2/2015
20:54'!
grimReaperProcess
"This is a process which waits for the death of a child OSProcess, and
informs any dependents of the change. Use SIGCHLD events if possible,
otherwise a Delay to poll for exiting child processes."
| eventWaiter processSynchronizationDelay |
^ self canAccessSystem
ifTrue:
[eventWaiter := (self canAccessSystem and: [self
canForwardExternalSignals])
ifTrue: [self sigChldSemaphore "semaphore
signaled by SIGCHLD" ]
ifFalse: [Delay forMilliseconds: 200 "simple
polling loop" ].
processSynchronizationDelay := Delay forMilliseconds:
20.
grimReaper ifNil:
[grimReaper :=
[[(eventWaiter respondsTo:
#waitTimeoutMSecs: )
ifTrue: [eventWaiter
waitTimeoutMSecs: 1000 "semaphore with timeout"]
ifFalse: [eventWaiter wait].
processSynchronizationDelay wait.
"Avoids lost signals in heavy process switching"
self changed: #childProcessStatus]
repeat] newProcess.
grimReaper resume.
"name selected to look reasonable in
the process browser"
grimReaper name: ((ReadStream on:
grimReaper hash asString) next: 5)
, ': the child
OSProcess watcher']]
ifFalse:
[nil]
! !