[Pharo-dev] Re: [Pharo-vm] Stack overflow support?

stephane ducasse Wed, 15 Nov 2023 01:49:02 -0800

Thanks.
We will digest and discuss internally. 

S


> On 14 Nov 2023, at 07:19, Daniel Slomovits <[email protected]> wrote:
> 
> Hello Stephane,
> 
> A simple example like that—really any case caused by common types of mistake, 
> even if the method involved is more complex—wouldn't get to run for a couple 
> seconds. It would error out after a small fraction of a second, since the 
> interrupt is issued immediately upon exceeding the limit (which by default 
> is—oops, 64k slots, 256kB, at most ~10k stack frames, but realistically less 
> due to args and temps), and it just doesn't take that long to execute most 
> recursive methods 10k levels deep.
> 
> At this point it would raise the limit a little for that process and signal a 
> StackOverflow exception. In production this likely would mean killing the 
> process but that's up to application code. In development you get a walkback 
> like any other exception. Because the stack is kept to a relatively 
> reasonable size, I've never had debugger performance be an issue—it's 
> noticeably more sluggish, but we're talking about taking 200-300ms to respond 
> instead of imperceptible. Similarly for GC pressure—even if you have dozens 
> of processes all maxed out on stack, this would still be less than the memory 
> used to start a base image.
> 
> Certainly there are scenarios that can cause problems—an error while printing 
> an object will stop you from opening a debugger, yes. However a StackOverflow 
> in particular isn't any worse than any other exception—in all cases you'll 
> get a walkback, hit "Debug", and get another walkback instead of a debugger. 
> If you hit "Terminate" at that point, it might leave the original process in 
> a zombie state and/or a hidden debugger window, but you can kill it from the 
> Process Monitor/close it from the Window menu and be fine. And even this is 
> because Dolphin doesn't safeguard printing in the development tools the way 
> Pharo does—if Pharo adopted stack overflow handling like Dolphin's, a stack 
> overflow likely wouldn't even stop you from opening a debugger, it would just 
> make it a little slow, then something would show up with "error printing: a 
> StackOverflow" or the like.
> 
> Hope that helps.
> 
> Daniel
> 
> On Sun, Nov 12, 2023 at 4:53 AM stephane ducasse <[email protected] 
> <mailto:[email protected]>> wrote:
>> Hi daniel
>> 
>> Thanks for the feedback.
>> May be you wrote it but I could not really understand. 
>> 
>> How dolphin handled
>> 
>>>> ```
>>>> A >> foo
>>>>    ^ self foo
>>>> ```
>> 
>> That is let to run a couple of seconds?
>> Did they kill the process?
>> 
>> In Pharo we do have an interrupt but 
>> 
>>>> But it could happen that,
>>>>  - the stack is so big that the debugger is very sluggish (best-case 
>>>> scenario)
>>>>  - the VM is just flooded doing GCs so maybe the Ctrl dot event does not 
>>>> even arrive at Pharo or the trigger
>>>>  - if the recursion is hit when printing an object (which is more common 
>>>> than you could imagine), opening the debugger could trigger a new 
>>>> recursion and never give back the control to the user
>> 
>> 
>> S
>> 
>>> On 11 Nov 2023, at 05:10, Daniel Slomovits <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> I think this is a great idea. I've mostly used Dolphin Smalltalk, which is 
>>> actually a strict stack machine under the hood (it has a context-like 
>>> introspection API but the stack is explicitly the canonical form), so it's 
>>> more-or-less forced to implement a limit of some kind. When I started using 
>>> Pharo I triggered a couple stack overflows by mistake, and was frustrated 
>>> by the fact that at first what happened was...nothing, everything seemed 
>>> fine, my code just didn't work. And then half a minute later Pharo gets 
>>> extremely slow and I notice it's using 2GB of memory and by then it's too 
>>> late and I have to kill the image. Getting a more-or-less immediate error 
>>> would be much more user-friendly IMO.
>>> 
>>> A couple things to learn from Dolphin's implementation, I think:
>>> When a stack overflow is detected, the resulting interrupt raises that 
>>> process' stack limit by a significant amount (though by less than the 
>>> original limit—IOW it doesn't double, but it's not just a couple more 
>>> frames either) before signaling the exception, precisely so that exception 
>>> handling can occur without triggering another stack-overflow event. A 
>>> further refinement could be that if a second stack overflow is detected, we 
>>> directly invoke more basic recovery—this could mean an emergency evaluator, 
>>> terminating the offending process and opening a post-mortem with a textual 
>>> stack dump (ugh! but at least it's predictable), etc.
>>> I don't think we should worry too much about refining what exactly the 
>>> limit is. 10x as much stack as 99% of code will ever use, is still a tiny 
>>> amount compared to consuming all available memory with Contexts. At least, 
>>> if I'm understanding the graph/data correctly. That's 36kB of stack space, 
>>> right? Not 36k frames/contexts deep? With each context being six slots plus 
>>> args/temps, 36kB is 500-750 frames on a 64-bit VM (in stack 
>>> representation—contexts add object-header overhead but we don't reify them 
>>> unless we have to). For reference, Dolphin's limit is 64kB, but that's a 
>>> 32-bit VM, so the equivalent for 64-bit would be 128kB...but because Pharo 
>>> can spill contexts to the stack, the limit could easily be 1MB, or a fixed 
>>> number of frames designed to approximate that—still a tiny amount of memory 
>>> overall, and still will be hit near-instantly by true infinite recursion, 
>>> but lots of breathing room for most use cases.
>>> Actually, this did get me to thinking...the stack depth of a Pharo process 
>>> is not necessarily easy/cheap to compute in the general case, without 
>>> caching a lot of information on intermediate contexts. In most cases the 
>>> context chain acts as a proper stack, but not always—methods like 
>>> Process>>on:do:, and some even more esoteric ones I forget off the top of 
>>> my head, make modifications far away from the top context and may splice 
>>> context chains together in odd ways. Perhaps a more flexible limit would be 
>>> better—one that is triggered by allocating more than a certain number of 
>>> contexts total, and examines running Processes in detail to find the 
>>> culprit at that point.
>>> 
>>> On Thu, Nov 9, 2023 at 4:38 AM Guillermo Polito <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> Hi all,
>>>> 
>>>> We started (with many interruptions over the last months) working a bit 
>>>> with Stephane on understanding what is the (positive and negative) impact 
>>>> of stack-overflow support in Pharo.
>>>> The key idea is that if a process consumes too much stack (potentially 
>>>> because of an infinite recursion) then the process should stop with an 
>>>> exception.
>>>> 
>>>> ## Why we want better stack consumption control
>>>> 
>>>> This idea comes up to solve issues that are pretty common and hit 
>>>> especially newbies.
>>>> For example, imagine you accidentally write an accessor such as
>>>> 
>>>> ```
>>>> A >> foo
>>>>    ^ self foo
>>>> ```
>>>> 
>>>> Students do this all the time, and I’ve also seen it in experienced people 
>>>> who go too fast :).
>>>> More importantly, such recursions could happen also with not-so-obvious 
>>>> indirect recursions (a sends b, b sends c, c sends a), and these could hit 
>>>> anybody.
>>>> 
>>>> This is aggravated because the current execution model allows us to have 
>>>> infinite stacks —meaning: limited by available memory only.
>>>> This is indeed a nice feature for many use cases but it has its own 
>>>> drawbacks when one of these kind of recursions are hit:
>>>>  - code just loops forever taking space in the stack
>>>>  - when there is no more stack space, context objects are created and 
>>>> moved to the heap
>>>>  - but those contexts are strongly held, so they are never GCed and take 
>>>> up extra space
>>>>  - even worse! they are there adding more work to the GC every time and 
>>>> making the GC run more often looking for space that is not there
>>>> 
>>>> ## Why Ctrl-dot does not always work
>>>> 
>>>> Of course, super users know there is this “Ctrl dot” hidden feature that 
>>>> should help you recover from this.
>>>> First, let's take out of the equation that this is only known by super 
>>>> users.
>>>> Now, in this situation, when Ctrl-dot is hit it will trigger a handler 
>>>> that suspends the problematic process and opens a debugger on it.
>>>> But it could happen that,
>>>>  - the stack is so big that the debugger is very sluggish (best-case 
>>>> scenario)
>>>>  - the VM is just flooded doing GCs so maybe the Ctrl dot event does not 
>>>> even arrive at Pharo or the trigger
>>>>  - if the recursion is hit when printing an object (which is more common 
>>>> than you could imagine), opening the debugger could trigger a new 
>>>> recursion and never give back the control to the user
>>>> 
>>>> ## What are we working on
>>>> 
>>>> The main idea here is: Can we have a simple and efficient way to prevent 
>>>> such kinds of situations?
>>>> 
>>>> After many discussions around detecting recursion, we kinda arrived at the 
>>>> simple solution of just detecting a stack overflow.
>>>> The solution is easy to understand (because it’s like other languages 
>>>> work) and easy to implement because there is already support for that.
>>>> But this leaves open two questions:
>>>>  - what happens when people want to use the “infinite stack” feature?
>>>>  - when should a process stack overflow? What is a sensitive default value?
>>>> 
>>>> Our draft implementation here 
>>>> https://github.com/pharo-project/pharo-vm/pull/710 does the following to 
>>>> cope with this:
>>>>  - we can now parametrize the size of the stack (of each stack page to be 
>>>> more accurate) when the VM starts up
>>>>  - the stack overflow check can be disabled per process
>>>> 
>>>> We also are running experiments to see what could be a sensitive stack 
>>>> size for our normal usages. Here, for example, we ran almost all test 
>>>> cases in Pharo separately (one suite per line below), and we observed how 
>>>> many tests broke (x-axis) with different stack sizes (y-axis).
>>>> Here we see that most test suites require at least 20-24k to run properly, 
>>>> some go up to 36k of stack before converging (i.e., the number of broken 
>>>> tests does not change).
>>>> 
>>>> <ImagenPegada-10.tiff>
>>>> 
>>>> You’ll notice in the graph that There are some scenarios that break all 
>>>> the time. This is because exception handling itself is recursive and may 
>>>> produce more stack overflows depending on the size of the stack between 
>>>> the exception and the exception handler.
>>>> So some more work is still required, mostly changing Pharo libraries to 
>>>> properly support this. For example:
>>>>  - should tests run in a fresh process with a fresh stack?
>>>>  - should the exception mechanism use less recursion?
>>>>  - resumable exceptions add stack pressure because they do not “unstack” 
>>>> until the exception is finally handled, meaning that the stack used by 
>>>> exception handling just adds up to the stack of the original code, can we 
>>>> do better here?
>>>> 
>>>> Probably there are more interesting questions here, that’s the “why" 
>>>> behind this email.
>>>> I’m interested in opinions and scenarios you may come up with that should 
>>>> be taken into account.
>>>> 
>>>> Cheers,
>>>> Guille
>>> _______________________________________________
>>> Pharo-vm mailing list -- [email protected] 
>>> <mailto:[email protected]>
>>> To unsubscribe send an email to [email protected] 
>>> <mailto:[email protected]>
>> 
>> --------------------------------------------
>> Stéphane Ducasse
>> http://stephane.ducasse.free.fr <http://stephane.ducasse.free.fr/> / 
>> http://www.pharo.org <http://www.pharo.org/> 
>> 03 59 35 87 52
>> Assistant: Aurore Dalle 
>> FAX 03 59 57 78 50
>> TEL 03 59 35 86 16
>> S. Ducasse - Inria
>> 40, avenue Halley, 
>> Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
>> Villeneuve d'Ascq 59650
>> France
>> 
>> 
>> 
> _______________________________________________
> Pharo-vm mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

--------------------------------------------
Stéphane Ducasse
http://stephane.ducasse.free.fr / http://www.pharo.org 
03 59 35 87 52
Assistant: Aurore Dalle 
FAX 03 59 57 78 50
TEL 03 59 35 86 16
S. Ducasse - Inria
40, avenue Halley, 
Parc Scientifique de la Haute Borne, Bât.A, Park Plaza
Villeneuve d'Ascq 59650
France

[Pharo-dev] Re: [Pharo-vm] Stack overflow support?

Reply via email to