Re: [drlvm] stress.Mix / MegaSpawn threading bug

Geir Magnusson Jr. Thu, 11 Jan 2007 04:29:09 -0800


On Jan 10, 2007, at 10:42 PM, Rana Dasgupta wrote:

On 1/10/07, Geir Magnusson Jr. <[EMAIL PROTECTED]> wrote:



On Jan 10, 2007, at 2:13 PM, Weldon Washburn wrote:

>> 1)
>> In some earlier posting, it was mentioned that somehow the virtual
>> memory

>> address space is impacted by how much physical memory is in agiven

>> computer.  Actually this is not true.  The virtual address space
>> available
>> to the JVM is fixed by the OS.  A machine with less phys mem will
>> do more
>> disk I/O.   In other words "C" malloc() hard limits are set by OS
>>version
>> number not by RAM chips.
>>

>Talking about VM vs RAM vs whatever is a red herring - we may be
>ported to a machine w/o virtual memory.  What matters is that when
>malloc() returns null, we do something smart.  At least, do nothing
>harmful.



There can be no machine without virtual memory on any of the OS's of
interest to us.

Who do you mean by "us"? I'm not included in that "us". Sure, themost popular and probable ports of harmony will be to environmentswith virtual memory, but you never know what people will want to dowith it.

Anyway, dealing with resource limits is simply good programmingpractice, even if you have virtual memory :)

VM is not a type of memory technology. What Weldon, Gregory
and several others have pointed out is that if one keeps on consuming

virtual address space by allocating space for thread stacks, theaddress

space will eventually run out, and the process will be a fatal state
independent of what is the physical memory on the machine.

I don't think that anyone is arguing with this point. But from acoding POV, it's utterly irrelevant. We seem to have a bit ofsloppyness deep down in DRLVM, and we need to fix it. End of story.

2)
>> Why not simply hard code DRLVM to throw an OOME whenever there are
>> more than
>> 1K threads running?  I think Rana first suggested this approach.
>> My guess
>> is that 1K threads is good enough to run lots of interesting
>> workloads.  My
>> guess is that common versions of WinXP and Linux will handle the C
>> malloc()
>> load of 1K threads successfully.  If not, how about trying 512
>> threads?

>Because this is picking up the rug, and sweeping all the dirt
>underneath it.  The core problem isn't that we try too many threads,
>but the code wasn't written defensively. Putting an artificiallimit
>on # of threads just means that we'll hit it somewhere else, in some
>other resource usage.

>I think we should fix it.
Sure. The way to fix a fatal error is to leave room for a processto recoverfrom it or handle it. Another example of a fatal error is a Stackoverflowor a TerminateProcess signal. In the case of Stack overflow, wehandle it bytrying to raise the exception while some room is left of the stackso thatthere is a fair chance to handle. Similarly, an approach could beto set alimit on the maximum number of threads we create. Based on thememory wegive each thread stack we can choose a limit which we estimate willleave us
room to handle the error.


Ok, so we do actually agree.

There seem to be some basic things we can do, like reduce the stack
>>size on windows from the terabyte or whatever it is now, to the
>>number that our dear, esteemed colleague from IBM claims isperfectly
>>suitable for production use.

>That too doesn't solve the problem, but it certainly fixes a problem
>we are now aware of - our stack size is too big.... :)
The best size to set for the thread stack is a valid issue, and itis usefulinformation to know what the IBM VM sets. Google searches also seemto showthat threadstack size on J9 is user configurable. But even withsmallerstack sizes, if one ran Megaspawn for sufficiently long time, wewould getthe same error. So we can't have unbounded stresses like this or,the VMneeds to bound the resources consumable by such a test. Also, wecannot justemulate what the IBM VM does in one specific area withoutunderstanding all
their entire design. For example, a small stack size will cause Stack
Overflow exceptions to happen early. We need to tune these sizesbased on
our own experiments.

Yep

>> 3)
>> The above does not deal with the general architecture question of
>> handling C
>> malloc failures.  This is far harder to solve.  Note that solving
>> the big
>> question will also require far more extensive regression teststhan
>> MegaSpawn.  However, it does fix DRLVM so that it does not crash/
>> burn on
>> threads overload.  This, in turn, gives us time to fix the real
>> underlying
>> problem(s) with C malloc.
I think that we should defer this part, it is a dificult problemand thereare several potential approaches based on what kind of reliablecomputing
contracts we want to expose. For example, one can think of a contract
that no fatal failures( OOME, stack overflow, thread abort ) happen
in marked regions of code, ever. I don't think that we need tosolve this
hard problem right now.

I disagree. I'd prefer that we at least investigate why we handlethis situation so badly, and at least get to a point where thesituation was handled consistently.


geir

Re: [drlvm] stress.Mix / MegaSpawn threading bug

Reply via email to