At Mon, 23 May 2011 22:26:20 -0400, Neil Van Dyke wrote: > Matthew Flatt wrote at 05/23/2011 10:11 PM: > > At Mon, 23 May 2011 22:01:31 -0400, Neil Van Dyke wrote: > > > >> We're not explicitly setting any stack limits anywhere. I believe but > >> am not certain that that core dump came from a "mzscheme -jqr" from > >> inside an Apache CGI context that got a native stack ulimit of 8192 kB > >> (the normal limit on that machine). Shall I confirm this? > >> > > > > Maybe, but I've become more interested in the possibility that other OS > > threads might have crashed. Does `info threads' work in gdb with a core > > file? > > > > I'm not certain "gdb" is accurate here, but I don't think that any C > code we use introduces any additional OS threads. > > #0 0x00000000005655b6 in GC_clear_stack_inner (arg=0x0, > limit=0x7fff2dd5ce30 <Address 0x7fff2dd5ce30 out of bounds>) at ./misc.c:243 > 243 ./misc.c: No such file or directory. > in ./misc.c > (gdb) info threads > 2 process 28526 0x00007fff316fcbe1 in nanosleep () from /lib/libc.so.6 > * 1 process 28525 0x00000000005655b6 in GC_clear_stack_inner (arg=0x0, > limit=0x7fff2dd5ce30 <Address 0x7fff2dd5ce30 out of bounds>) at ./misc.c:243
That looks right. The nanosleep() thread is there to trigger a Racket-thread switch every 100ms or so, but it's apparently not crashing in the attempt. > >> Could code evaluated at module load time, such as "make-standard-set" > >> (which has some non-tail calls in loops, I don't know the size), be > >> using lots of stack, and, once every 100,000 runs of a large program, > >> combines with nondeterministic GC behavior and a bug to cause a seg fault? > >> > > > > It seems unlikely that any module is using lots of C stack relative to > > 8MB, so I think we must be missing something simpler. Nondeterministic > > GC behavior seems like a likely part of the puzzle, though. > > > > (I'm not sure whether we're talking about a Scheme stack that is > different than the native stack) Could we be having an overly large > stack quite often, and the rareness of the crash is only because usually > the stack does not collide with non-stack memory in a detectable way? Neither the C stack or Scheme stack (yes, they are separate) seems particularly large. There's one overflow of the Scheme stack, but that's not surprising since it starts small and grows on demand. I guess we're back to checking on the stack size. Maybe also disassemble GC_clear_stack_inner() so we can be clear on what part of the function is crashing? Thanks, Matthew _________________________________________________ For list-related administrative tasks: http://lists.racket-lang.org/listinfo/users

