Hi Ola
It is me again and I am emailing just to have record of my possibly fruitless
findings... actually by the time I finished I resolved it for myself!!!! So it
might indeed be informative!
So from the beginning:
So -- it hanged again and I am trying to debug it once again.
backtrace is
(gdb) bt
#0 0x00002ae840181ee2 in __libc_fork () from /usr/lib/debug/libc.so.6
#1 0x000000000043cd90 in Popen ()
#2 0x000000000043e884 in LoadAuthorization ()
#3 0x000000000043ea76 in CheckAuthorization ()
#4 0x0000000000439a25 in ClientAuthorized ()
#5 0x000000000041e396 in ProcEstablishConnection ()
#6 0x0000000000424672 in Dispatch ()
#7 0x000000000040b145 in main ()
though it is weird since it hanged right in the middle of working and I didn't
try to authenticate (may be someone else???)...
in .log file I have inserted by us
Popen called with command='cat /home/yoh/.Xauthority' type='r' as arguments.
and mtime on log file right around the point when it hanged so I guess it is
the right one
symbols pointed to ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c
I have downloaded sources and unpacked them, but fork.c pretty much is include
of ../fork.c (and also I had to ln -s sysv to sysdeps) and gdb is silly (or me
is) to don't look there...
so now I get
(gdb) l
Line number 32 out of range; ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c has
31 lines
when I go to that fork.c manually into __libc_fork I see 2 possible
causes for infinite loops:
while ((runp = __fork_handlers) != NULL)
{
unsigned int oldval = runp->refcntr;
if (oldval == 0)
/* This means some other thread removed the list just after
the pointer has been loaded. Try again. Either the list
is empty or we can retry it. */
continue;
/* Bump the reference counter. */
if (atomic_compare_and_exchange_bool_acq (&__fork_handlers->refcntr,
oldval + 1, oldval))
/* The value changed, try again. */
continue;
1. so if oldval stays 0, I am doomed
2. atomic_compare_and_exchange_bool_acq ... not sure
unfortunately I can't print out any of variables (like oldval)
nexti goes through SmartScheduleTimer() which I have no clue what is it
about.... ha -- actually it is from vncserver!
then it escapes:
0x000000000043c545 in SmartScheduleTimer ()
0x00002ae84011f110 in __restore_rt () from /usr/lib/debug/libc.so.6
0x00002ae84011f117 in __restore_rt () from /usr/lib/debug/libc.so.6
0x00002ae840181ee0 in __libc_fork () from /usr/lib/debug/libc.so.6
unfortunately I am still out of luck in printing anything
(gdb) l
Line number 32 out of range; ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c has
31 lines.
(gdb) p oldval
No symbol "oldval" in current contex
ok -- if I go in full through the 'loop' with nexti I get
So I guess I am out of luck on the first condition (although who knows what
tricks optimization did for me)... ok and here is the source of that
SmartScheduleTimer
void
SmartScheduleTimer (int sig)
{
int olderrno = errno;
SmartScheduleTime += SmartScheduleInterval;
if (SmartScheduleIdle)
{
SmartScheduleStopTimer ();
}
errno = olderrno;
}
I wonder how does it interact with that WaitForSomething, and that beast is
filledup with #ifdefs so it is barely comprehendable, bt that is the only place
which could trigger SmartScheduleIdle (or may be I missed some other) and
I am not sure how scheduling and switching is done so I am not clear how it
could ever be reset.
And my knowledge and brain is somewhat far from comprehending
sysdeps/unix/sysv/linux/x86_64/sigaction.c and __restore_rt
but ok -- let see a bit more
SmartScheduleTimer
0x000000000043c52c <SmartScheduleTimer+44>: test %esi,%esi
0x000000000043c52e <SmartScheduleTimer+46>: je 0x43c535
<SmartScheduleTimer+53>
0x000000000043c530 <SmartScheduleTimer+48>: callq 0x43c3f0
<SmartScheduleStopTimer>
0x000000000043c535 <SmartScheduleTimer+53>: mov %ebp,(%rbx)
0x000000000043c537 <SmartScheduleTimer+55>: mov 0x8(%rsp),%rbx
and we step around 530
0x000000000043c52e in SmartScheduleTimer ()
0x000000000043c535 in SmartScheduleTimer ()
so for sure we are not calling SmartScheduleStopTimer ;)
lets do manually:
(gdb) call SmartScheduleStopTimer
+call SmartScheduleStopTimer
$1 = {<text variable, no debug info>} 0x43c3f0 <SmartScheduleStopTimer>
*(gdb) call SmartScheduleStopTimer()
+call SmartScheduleStopTimer()
Reading in symbols for ../sysdeps/x86_64/elf/start.S...done.
$2 = 0
(gdb) nexti
+nexti
Detaching after fork from child process 23194.
0x00002ae840181ee8 in __libc_fork () from /usr/lib/debug/libc.so.6
ha -- some effect... lets see
(gdb) c
+c
Continuing.
Program received signal SIGPIPE, Broken pipe.
0x00002ae8401ac3e2 in __write_nocancel () from /usr/lib/debug/libc.so.6
(gdb) c
+c
Continuing.
but we are still on the hook -- 100% CPU and in the same fashion after I
press Ctrl-C
ok -- doing the same call to SmartScheduleStopTimer and then doing
stepping which might be informative:
(gdb) call SmartScheduleStopTimer()
+call SmartScheduleStopTimer()
$3 = 0
(gdb) nexti
+nexti
Detaching after fork from child process 23893.
0x00002ae840181ee8 in __libc_fork () from /usr/lib/debug/libc.so.6
(gdb) n
+n
Single stepping until exit from function __libc_fork,
which has no line number information.
Reading in symbols for genops.c...done.
Reading in symbols for malloc.c...done.
0x000000000043cd90 in Popen ()
(gdb) n
+n
Single stepping until exit from function Popen,
which has no line number information.
0x000000000043e884 in LoadAuthorization ()
(gdb) l
+l
Line number 32 out of range; ../nptl/sysdeps/unix/sysv/linux/x86_64/fork.c has
31 lines.
(gdb) n
+n
Single stepping until exit from function LoadAuthorization,
which has no line number information.
0x000000000043ea76 in CheckAuthorization ()
(gdb)
+n
Single stepping until exit from function CheckAuthorization,
which has no line number information.
0x0000000000439a25 in ClientAuthorized ()
(gdb)
+n
Single stepping until exit from function ClientAuthorized,
which has no line number information.
0x000000000041e396 in ProcEstablishConnection ()
(gdb)
+n
Single stepping until exit from function ProcEstablishConnection,
which has no line number information.
0x000000000041e0d0 in SendConnSetup ()
(gdb)
+n
Single stepping until exit from function SendConnSetup,
which has no line number information.
0x0000000000424672 in Dispatch ()
(gdb)
+n
Single stepping until exit from function Dispatch,
which has no line number information.
BOY -- now my VNC is reacting!!!! slugish but working... lets try to detach
Quit
(gdb) detach
+detach
Detaching from program: /usr/bin/Xvnc4, process 2394
and I am again in the working VNC!!!! uff ;-))))
On Mon, 28 Apr 2008, Ola Lundqvist wrote:
> Hi again
> On Mon, Apr 28, 2008 at 03:28:06PM -0400, Yaroslav Halchenko wrote:
> > > I'm not perfectly sure but some things that I suspect is the problem is
> > > that the
> > > number of open files, open sockets, number of processes os something
> > > similar has
> > > reached its limit.
> > > The reason is that you get ERESTARTNOINTR.
> > thanks for sharing the knowledge ;-) I guess I just need to figure out
> > how to monitor all the resources from a single point...
> ::)
> > > Have you seen this on several systems or just one?
> > unfortunatly I use VNC primarily on that only box, thus I didn't see it
> > anywhere else. If only we could figure out the loop where it gets to
> > 100% I guess I could figure out what rejection does it get (ie what
> > resource is the problem)
> To me it seems more like you have really problematic libc or kernel. Because
> I see from your information that you have provided that you can get this
> problem in quite a few situation.
> Are you sure that you do not have a broken installation like buggy kernel
> or libc?
> I mean it should not really hang in fork...
> Best regards,
> // Ola
> > > Best regards,
> > > // Ola
> > > > Sorry for being so anal... stalled once again today. From gdb now it is
> > > > at fork and
> > > > never actually exits it :-/ If someone could build it with
> > > > Loaded symbols for /lib64/ld-linux-x86-64.so.2
> > > > 0x00002b68df98cee2 in fork () from /lib/libc.so.6
> > > > (gdb) bt
> > > > #0 0x00002b68df98cee2 in fork () from /lib/libc.so.6
> > > > #1 0x000000000043cd90 in Popen ()
> > > > #2 0x000000000043e884 in LoadAuthorization ()
> > > > #3 0x000000000043ea76 in CheckAuthorization ()
> > > > #4 0x0000000000439a25 in ClientAuthorized ()
> > > > #5 0x000000000041e396 in ProcEstablishConnection ()
> > > > #6 0x0000000000424672 in Dispatch ()
> > > > #7 0x000000000040b145 in main ()
> > > > (gdb) finish
> > > > Run till exit from #0 0x00002b68df98cee2 in fork () from /lib/libc.so.6
> > > > Program received signal SIGINT, Interrupt.
> > > > 0x00002b68df98cee2 in fork () from /lib/libc.so.6
> > > > (gdb) bt
> > > > #0 0x00002b68df98cee2 in fork () from /lib/libc.so.6
> > > > strace was busy with
> > > > 14892 rt_sigreturn(0xe) = 56
> > > > 14892 clone(child_stack=0,
> > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be
> > > > restarted)ld_stack=0,
> > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD
> > > > 14892 --- SIGALRM (Alarm clock) @ 0
> > > > (0) ---
> > > > 14892 rt_sigreturn(0xe) = 56
> > > > 14892 clone(child_stack=0,
> > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be
> > > > restarted)ld_stack=0,
> > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD
> > > > 14892 --- SIGALRM (Alarm clock) @ 0
> > > > (0) ---
> > > > 14892 rt_sigreturn(0xe) = 56
> > > > 14892 clone(child_stack=0,
> > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be
> > > > restarted)nfinished ...>
> > > > 14892 --- SIGALRM (Alarm clock) @ 0 (0) ---
> > > > 14892 rt_sigreturn(0xe) = 56
> > > > 14892 clone(child_stack=0,
> > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be restarted)
> > > > 14892 --- SIGALRM (Alarm clock) @ 0 (0) ---
> > > > 14892 rt_sigreturn(0xe) = 56
> > > > 14892 clone(child_stack=0,
> > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> > > > child_tidptr=0x2b68dfb39160) = ? ERESTARTNOINTR (To be restarted)
> > > > It would so great if there is a vnc4server-dbg ;-)))
> > > > BTW -- last line in .log was due to our inserted debug line
> > > > Popen called with command='cat /home/yoh/.Xauthority' type='r' as
> > > > arguments
> > > > but I am not sure if that wasn't from original login moment earlier in
> > > > the morning
> > > > On Mon, 21 Apr 2008, Ola Lundqvist wrote:
> > > > > > stracing was showing lots of getttimeoftheday or whatever that
> > > > > > syscall
> > > > > > is. Today it was different:
> > > > > > 21162 rt_sigreturn(0xe) = 56
> > > > > > 21162 clone(child_stack=0,
> > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> > > > > > child_tidptr=0x2ad7a050a160) = ? ERESTARTNOINTR (To be restarted)
> > > > > > 21162 --- SIGALRM (Alarm clock) @ 0 (0) ---
> > > > > > 21162 rt_sigreturn(0xe) = 56
> > > > > > 21162 clone(child_stack=0,
> > > > > > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> > > > > > child_tidptr=0x2ad7a050a160) = ? ERESTARTNOINTR (To be restarted)
> > > > > > ...
> > > > > Hmm. To me it looks that we are out of resources...
> > > > --
> > > > Yaroslav Halchenko
> > > > Research Assistant, Psychology Department, Rutgers-Newark
> > > > Student Ph.D. @ CS Dept. NJIT
> > > > Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
> > > > 101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
> > > > WWW: http://www.linkedin.com/in/yarik
> > --
> > Yaroslav Halchenko
> > Research Assistant, Psychology Department, Rutgers-Newark
> > Student Ph.D. @ CS Dept. NJIT
> > Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
> > 101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
> > WWW: http://www.linkedin.com/in/yarik
--
Yaroslav Halchenko
Research Assistant, Psychology Department, Rutgers-Newark
Student Ph.D. @ CS Dept. NJIT
Office: (973) 353-5440x263 | FWD: 82823 | Fax: (973) 353-1171
101 Warren Str, Smith Hall, Rm 4-105, Newark NJ 07102
WWW: http://www.linkedin.com/in/yarik
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]