So, finally it turned out that the culprit is in my own code. I was logging exception objects that have a signaler context pointing to the socket. This way every connection timeout I added the exception to a collection preventing unregistering of external resources.
Norbert Am 11.10.2013 um 15:02 schrieb Norbert Hartl <[email protected]>: > > > Am 11.10.2013 um 10:53 schrieb Sven Van Caekenberghe <[email protected]>: > >> >> On 11 Oct 2013, at 10:24, Norbert Hartl <[email protected]> wrote: >> >>> I can report that the behavior is different now. There were two new vm >>> releases this week in ppa. The first one didn't work but the second changed >>> something. My application was never running that long. It is more than a >>> day now having an actual external objects table size of 623 which wasn't >>> ever reached before. So I would say that there is chance that this >>> particular problem is gone. I monitor this further and I think that this >>> wasn't the only problem. But then it is another problem. >> >> Yeah, but not knowing your application load, 623, which would be about 200 >> sockets (3 semaphores per sockets), is still a lot to be active at the same >> time. Can you in some way invoke a full GC externally, like using >> ZnReadEvalPrintDelegate and see if it eventually drops due to finalization ? >> It should, at least that is what I see. >> > Yes, that's what I meant. There is always only one outgoing connection at a > time. Every 15 seconds one request is issued. So you see why expect more to > find. > I'm travelling right now and will have a deeper look after being back > > Norbert >>> Thanks to all of you who've helped solving this. If it comes to the VM >>> being the source of problems it is always extra annoying because it is way >>> harder to change something there. >>> >>> Norbert >>> >>> >>> Am 08.10.2013 um 11:27 schrieb Igor Stasenko <[email protected]>: >>> >>>> >>>> >>>> >>>> On 7 October 2013 18:36, Norbert Hartl <[email protected]> wrote: >>>> >>>> Am 07.10.2013 um 16:36 schrieb Igor Stasenko <[email protected]>: >>>> >>>>> 1 thing. >>>>> >>>>> can you tell me what given expression yields for your VM/image: >>>>> >>>>> Smalltalk vm maxExternalSemaphores >>>>> >>>>> (if it gives you number less than 10000000 then i think i know what is >>>>> your problem :) >>>> It is 10000000 >>>> >>>> What would be the problem if it would be smaller? >>>> >>>> >>>> that just means your VM don't have external object size cap. >>>> I changed the implementation to not have hard limit (the arbitrary large >>>> number >>>> is there just to be "compatible" with previous implementation). >>>> >>>> This means, that you can actually change in your image the check and >>>> completely ignore limits >>>> and just keep growing if it necessary. >>>> >>>> Now, since you using VM which don't have a limit, but problem still >>>> persists, >>>> it seems like it somewhere else.. :/ >>>>> i just found that after one merge, my changes get lost >>>>> we're just plugged them back in, and it should be back again with newer >>>>> VMs.. >>>>> but the problem could be more than just semaphores.. if merge broken >>>>> this, it may break >>>>> many other things, so we need time to check >>>> I try to look at it some more time. I'm using the pharo-vm from the >>>> launchpad build. Are the changes supposed to be in this one? >>>> >>>> Norbert >>>> >>>> Launchpad? You mean ppa? I can't say i remember all the details how >>>> changes to VM source >>>> gets into ppa distro, and how fast they get there. @Damien, can you >>>> enlighten us? >>>> >>>> >>>> Well, the VM which i downloaded recently using zero-conf script, having >>>> limit back to 256. Just some merge mistake, which now is fixed.. means >>>> that couple builds will use limit-based implementation.. but then >>>> it will be back to my implementaiton. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 7 October 2013 12:31, Norbert Hartl <[email protected]> wrote: >>>>> >>>>> Am 07.10.2013 um 11:28 schrieb Henrik Johansen >>>>> <[email protected]>: >>>>> >>>>>> >>>>>> On Oct 7, 2013, at 11:16 , Norbert Hartl <[email protected]> wrote: >>>>>> >>>>>>> As I need an image that runs longer than 24 hours I'm looking at some >>>>>>> stuff and wonder. Can anybody explain me the rationale for a code like >>>>>>> this >>>>>>> >>>>>>> maxExternalSemaphores: aSize >>>>>>> "This method should never be called as result of normal program >>>>>>> execution. If it is however, handle it differently: >>>>>>> - In development, signal an error to promt user to set a bigger size >>>>>>> at startup immediately. >>>>>>> - In production, accept the cost of potentially unhandled interrupts, >>>>>>> but log the action for later review. >>>>>>> >>>>>>> See comment in maxExternalObjectsSilently: why this behaviour is >>>>>>> desirable, " >>>>>>> "Can't find a place where development/production is decided. >>>>>>> Suggest Smalltalk image inProduction, but use an overridable temp >>>>>>> meanwhile. " >>>>>>> | inProduction | >>>>>>> self maxExternalSemaphores >>>>>>> ifNil: [^ 0]. >>>>>>> inProduction := false. >>>>>>> ^ inProduction >>>>>>> ifTrue: [self maxExternalSemaphoresSilently: aSize. >>>>>>> self crTrace: 'WARNING: Had to increase size of semaphore >>>>>>> signal handling table due to many external objects concurrently in use'; >>>>>>> crTrace: 'You should increase this size at startup using >>>>>>> #maxExternalObjectsSilently:'; >>>>>>> crTrace: 'Current table size: ' , self >>>>>>> maxExternalSemaphores printString] >>>>>>> ifFalse: ["Smalltalk image" >>>>>>> self error: 'Not enough space for external objects, set a >>>>>>> larger size at startup!' >>>>>>> "Smalltalk image"] >>>>>>> >>>>>>> I have reported this once but got no feedback so I like to have a few >>>>>>> opinions. >>>>>>> >>>>>>> The report is here: https://pharo.fogbugz.com/f/cases/10839/ >>>>>>> >>>>>>> Norbert >>>>>> >>>>>> The rationale is that inProduction would be some global setting, not yet >>>>>> in place when the code was written… >>>>>> Excessive simultaneous Semaphore usage is something that should be >>>>>> caught during development, in which case it's better to get an active >>>>>> notification, than having it logged somewhere. >>>>> >>>>> Agreed. But didn't work in my case because it needed roughly 20 hours and >>>>> an instable remote backend to trigger the problem. And somehow I forgot >>>>> to install my logger as Transcript so there is no warning message. I saw >>>>> only dead images in the morning. >>>>> This not satisfactory but on the other hand this type of problems are >>>>> hard to solve anyway. My feeling tells me there is more to discover. >>>>> Sockets resources get unregistered at finalization time but this didn't >>>>> work either. I would have said that the unlikely situation that no >>>>> garbage collection ran could be the case. But it can't because in >>>>> ExternalSemaphoreTable>>#freedSlotsIn:ratherThanIncreaseSizeTo: there is >>>>> explicit garbage collection. >>>>> >>>>>> If I've understood correctly, it's moot on newer Pharo VM's, where >>>>>> there's no limit on the semtable size, but for legacy code a startup >>>>>> item setting size using maxExternalObjectsSilently: (as suggested in the >>>>>> Warning text), is still a more proper fix than setting inProduction to >>>>>> true and crossing your fingers hoping no signals will be lost during >>>>>> table growth. >>>>> >>>>> Ah, I didn't know about the risk of loosing signals while resizing the >>>>> table. Thanks for that. Don't get me wrong I wasn't proposing to set >>>>> inProduction in effect. I don't think that automatically growing resource >>>>> management is a proper way to design a system. There is always a range of >>>>> resources you need for your use case. Not setting an upper bound for this >>>>> just covers leaking behavior. >>>>> >>>>> Norbert >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards, >>>>> Igor Stasenko. >>>> >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Igor Stasenko. >> >> >
